HashMap Internals
HashMap Internals
The Big Picture
A HashMap (called unordered_map in C++) is the workhorse data structure behind constant-time
key-value lookups. It provides O(1) average-case insertion, lookup, and deletion. But how does
it actually achieve this? Understanding the internals transforms a HashMap from a magical black box
into a tool you can reason about, debug, and even exploit in competitive programming.
At its core, a HashMap works in two stages:
- Hash the key to produce an integer index.
- Store the value at (or near) that index in an underlying array.
The entire design challenge revolves around one question: what happens when two different keys hash to the same index? That is the collision problem, and everything in this document flows from it.
Hash Functions
What Is a Hash Function?
A hash function is a mapping:
h : KeySpace --> {0, 1, 2, ..., m-1}
where m is the size of the underlying array (the "table"). It takes an arbitrary key (an integer,
a string, a struct) and produces an integer that serves as an index into the array.
Essential Properties
| Property | Why It Matters |
|---|---|
| Deterministic | Same key must always produce the same hash |
| Uniform distribution | Keys should spread evenly across buckets to minimize collisions |
| Fast to compute | Hashing happens on every operation; it must be O(1) or close to it |
| Avalanche effect | Small change in key should produce a very different hash value |
Hash Functions for Integers
The simplest approach: division method.
h(k) = k mod m
If m = 10 and k = 37, then h(37) = 7.
Why use a prime number for m?
Consider what happens if m is a power of 2, say m = 8:
h(k) = k mod 8 (equivalent to taking last 3 bits of k)
k = 0 (000) --> bucket 0
k = 8 (1000) --> bucket 0
k = 16 (10000) --> bucket 0
k = 24 (11000) --> bucket 0
If your keys share common factors with m, they will all cluster into the same buckets.
A prime m has no common factors with typical key patterns, so the distribution is more uniform.
Multiplication method (Knuth's suggestion):
h(k) = floor(m * (k * A mod 1))
where A is an irrational constant (often the golden ratio A = (sqrt(5) - 1) / 2 ~ 0.6180).
This works well regardless of m.
Hash Functions for Strings
The standard approach is the polynomial rolling hash:
h(s) = ( s[0]*p^0 + s[1]*p^1 + s[2]*p^2 + ... + s[n-1]*p^(n-1) ) mod m
where p is a prime base (commonly 31 or 37) and m is a large prime modulus.
long long stringHash(const string& s, int tableSize) {
long long hash = 0;
long long p_pow = 1;
const int p = 31;
const int mod = 1e9 + 9;
for (char c : s) {
hash = (hash + (c - 'a' + 1) * p_pow) % mod;
p_pow = (p_pow * p) % mod;
}
return hash % tableSize;
}
Why does this work? Each character contributes to a different power of p, so "abc" and "bac"
produce different hash values. The polynomial structure ensures that even a single character
change creates a completely different hash.
Good vs Bad Hash Functions
Terrible hash function #1: Always return 0
int badHash1(int key) { return 0; }
Every key goes to bucket 0. The HashMap degenerates into a linked list. All operations become O(n).
Terrible hash function #2: Return the key itself (without mod)
int badHash2(int key) { return key; } // no modular reduction!
This causes out-of-bounds access or requires an impossibly large array.
Terrible hash function #3: Ignoring part of the key
int badHash3(string key) { return key[0] % tableSize; } // only uses first char
All strings starting with 'a' go to the same bucket. "apple", "algorithm", "array" all collide.
Good hash function: Considers all bits, spreads uniformly
// C++ std::hash for integers uses identity or splitmix-like functions
// For strings, it uses FNV-1a or similar
size_t goodHash(int key) {
// splitmix64 - excellent distribution
uint64_t x = key;
x += 0x9e3779b97f4a7c15;
x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9;
x = (x ^ (x >> 27)) * 0x94d049bb133111eb;
return x ^ (x >> 31);
}
Visualizing Good vs Bad Distribution
Collision Resolution
No matter how good your hash function is, collisions are inevitable. By the Birthday Paradox,
with a table of size m, you expect a collision after inserting only about sqrt(m) elements.
For a table of size 1,000,000, expect a collision after roughly 1,000 insertions.
There are two fundamental approaches to handling collisions.
Method 1: Chaining (Separate Chaining)
Idea: Each bucket in the array holds a pointer to a linked list (or dynamic array). When multiple keys hash to the same bucket, they are all stored in that bucket's list.
Operations with chaining:
-
Insert(key, value): Compute
idx = hash(key). Walk the chain atbuckets[idx]to check if the key already exists (update it). Otherwise, prepend a new node. O(1) amortized. -
Lookup(key): Compute
idx = hash(key). Walk the chain atbuckets[idx]comparing each node's key. Return when found or null if you reach the end. -
Delete(key): Same as lookup, but remove the node from the linked list.
Load Factor:
alpha = n / m (number of elements / number of buckets)
The load factor directly determines performance:
alpha = 0.5: On average, each chain has 0.5 elements. Very fast.alpha = 1.0: On average, each chain has 1 element. Still fast.alpha = 5.0: On average, each chain has 5 elements. Getting slow.
A successful search requires traversing, on average, 1 + alpha/2 nodes.
An unsuccessful search requires traversing, on average, alpha nodes.
When alpha exceeds a threshold (typically 0.75), we rehash.
Method 2: Open Addressing
Idea: All elements are stored directly in the table array (no linked lists). When a collision occurs, we "probe" for the next available slot using a deterministic sequence.
The table must have m >= n (more buckets than elements). The load factor alpha is always < 1.
Linear Probing
The simplest probing strategy: if slot h(k) is occupied, try h(k)+1, then h(k)+2, etc.
h(k, i) = (h(k) + i) mod m for i = 0, 1, 2, ...
Example: Insert 10, 22, 31, 4, 15, 28, 17, 88 into table of size 11 with h(k) = k mod 11
Insert 10: h(10)=10 Table: [ _ _ _ _ _ _ _ _ _ _ 10]
Insert 22: h(22)=0 Table: [22 _ _ _ _ _ _ _ _ _ 10]
Insert 31: h(31)=9 Table: [22 _ _ _ _ _ _ _ _ 31 10]
Insert 4: h(4)=4 Table: [22 _ _ _ 4 _ _ _ _ 31 10]
Insert 15: h(15)=4 CLASH! Table: [22 _ _ _ 4 15 _ _ _ 31 10]
try 5, empty ^--^
Insert 28: h(28)=6 Table: [22 _ _ _ 4 15 28 _ _ 31 10]
Insert 17: h(17)=6 CLASH!
try 7, empty Table: [22 _ _ _ 4 15 28 17 _ 31 10]
^------^
Insert 88: h(88)=0 CLASH!
try 1, empty Table: [22 88 _ _ 4 15 28 17 _ 31 10]
^--^
The Clustering Problem:
Linear probing suffers from primary clustering. Once a contiguous block of occupied cells forms, any new key that hashes anywhere into that block will extend it by one more cell. The bigger the cluster, the more likely it is to grow even bigger. This creates a vicious cycle.
Clustering visualization (occupied slots marked with #):
Before: [ _ _ # # # _ _ # _ _ ]
^
key hashes here, walks right
After: [ _ _ # # # # _ # _ _ ]
^---^
cluster grew! Now even more keys will hit it.
Much later: [ # # # # # # # # _ _ ]
^-----huge cluster-----^
Every insertion probes many slots!
Quadratic Probing
Instead of stepping linearly, use quadratic steps:
h(k, i) = (h(k) + c1*i + c2*i^2) mod m for i = 0, 1, 2, ...
Common choice: c1 = 0, c2 = 1 giving h(k, i) = (h(k) + i^2) mod m.
Probing sequence for h(k) = 5, m = 11:
i=0: slot 5
i=1: slot 6 (5 + 1)
i=2: slot 9 (5 + 4)
i=3: slot 3 (5 + 9) mod 11
i=4: slot 10 (5 + 16) mod 11
The jumps spread out quickly, avoiding primary clustering. However, keys that hash to the same initial slot still follow the same probe sequence -- this is called secondary clustering.
Caveat: Quadratic probing may not visit all slots. It is guaranteed to find an empty slot only
if alpha < 0.5 and m is prime.
Double Hashing
Use a second hash function to determine the step size:
h(k, i) = (h1(k) + i * h2(k)) mod m
where h2(k) must never be 0 (otherwise we get stuck). Common choice: h2(k) = 1 + (k mod (m-1)).
Example: h1(k)=k%11, h2(k)=1+(k%10), m=11
Insert key 22: h1=0, h2=1+(22%10)=3
Probing: slot 0, slot 3, slot 6, slot 9, slot 1, ...
Insert key 33: h1=0, h2=1+(33%10)=4
Probing: slot 0, slot 4, slot 8, slot 1, slot 5, ...
Two keys with the same h1 value will have different h2 values and thus different probe
sequences. This eliminates both primary and secondary clustering.
Comparison of open addressing methods:
Visual Comparison: Same Insertions, Different Strategies
Insert keys: 10, 22, 31, 4, 15 into a table of size 7.
Chaining (h(k) = k % 7):
Linear Probing (h(k) = k % 7):
Insert 10: h=3 [ _ _ _ 10 _ _ _]
Insert 22: h=1 [ _ 22 _ 10 _ _ _]
Insert 31: h=3 -> 4 [ _ 22 _ 10 31 _ _] (3 occupied, try 4)
Insert 4: h=4 -> 5 [ _ 22 _ 10 31 4 _] (4 occupied, try 5)
Insert 15: h=1 -> 2 [ _ 22 15 10 31 4 _] (1 occupied, try 2)
Notice the cluster forming at indices 3-5!
Lookup for 4: check slot 4 (31, no), slot 5 (4, yes). 2 probes.
Quadratic Probing (h(k) = k % 7):
Insert 10: h=3 [ _ _ _ 10 _ _ _]
Insert 22: h=1 [ _ 22 _ 10 _ _ _]
Insert 31: h=3 -> 4 [ _ 22 _ 10 31 _ _] (3+1^2=4)
Insert 4: h=4 -> 5 [ _ 22 _ 10 31 4 _] (4+1^2=5)
Insert 15: h=1 -> 2 [ _ 22 15 10 31 4 _] (1+1^2=2)
Same result here, but in general quadratic avoids the growing cluster.
Rehashing
When the load factor exceeds a threshold, the table is resized (typically doubled) and all elements are reinserted into the new table. This is called rehashing.
Why reinsert everything? Because an element's bucket index depends on the table size:
h(k) = k mod m. If m changes, the element may belong in a completely different bucket.
Step-by-Step Rehashing Example
<rect class="cell" x="0" y="36" width="45" height="30"/><text x="14" y="57">1</text>
<line x1="45" y1="51" x2="72" y2="51" stroke="currentColor" stroke-width="1.5" marker-end="url(#arr3)"/>
<rect class="node" x="77" y="38" width="40" height="26"/><text x="90" y="56">5</text>
<text class="lbl" x="130" y="56">5 % 4 = 1</text>
<rect class="cell" x="0" y="72" width="45" height="30"/><text x="14" y="93">2</text>
<text class="lbl" x="55" y="93">(empty)</text>
<rect class="cell" x="0" y="108" width="45" height="30"/><text x="14" y="129">3</text>
<line x1="45" y1="123" x2="72" y2="123" stroke="currentColor" stroke-width="1.5" marker-end="url(#arr3)"/>
<rect class="node" x="77" y="110" width="40" height="26"/><text x="90" y="128">7</text>
<text class="lbl" x="130" y="128">7 % 4 = 3</text>
<rect class="cell" x="0" y="30" width="45" height="26"/><text x="14" y="49">1</text>
<text class="lbl" x="55" y="49">(empty)</text>
<rect class="cell" x="0" y="60" width="45" height="26"/><text x="14" y="79">2</text>
<text class="lbl" x="55" y="79">(empty)</text>
<rect class="cell" x="0" y="90" width="45" height="26"/><text x="14" y="109">3</text>
<line x1="45" y1="103" x2="72" y2="103" stroke="currentColor" stroke-width="1.5" marker-end="url(#arr3)"/>
<rect class="node" x="77" y="91" width="40" height="24"/><text x="90" y="108">3</text>
<text class="lbl" x="130" y="108">3 % 8 = 3</text>
<rect class="cell" x="0" y="120" width="45" height="26"/><text x="14" y="139">4</text>
<text class="lbl" x="55" y="139">(empty)</text>
<rect class="cell" x="0" y="150" width="45" height="26"/><text x="14" y="169">5</text>
<line x1="45" y1="163" x2="72" y2="163" stroke="currentColor" stroke-width="1.5" marker-end="url(#arr3)"/>
<rect class="node" x="77" y="151" width="40" height="24"/><text x="90" y="168">5</text>
<text class="lbl" x="130" y="168">5 % 8 = 5</text>
<rect class="cell" x="0" y="180" width="45" height="26"/><text x="14" y="199">6</text>
<text class="lbl" x="55" y="199">(empty)</text>
<rect class="cell" x="0" y="210" width="45" height="26"/><text x="14" y="229">7</text>
<line x1="45" y1="223" x2="72" y2="223" stroke="currentColor" stroke-width="1.5" marker-end="url(#arr3)"/>
<rect class="node" x="77" y="211" width="40" height="24"/><text x="90" y="228">7</text>
<text class="lbl" x="130" y="228">7 % 8 = 7</text>
Cost Analysis:
Rehashing is O(n) because we must reinsert every element. But consider the total cost of n insertions:
Insertions 1..m: m insertions, each O(1) = O(m)
Rehash at m: reinsert m elements = O(m)
Insertions m+1..2m: m insertions, each O(1) = O(m)
Rehash at 2m: reinsert 2m elements = O(2m)
...
Total cost for n insertions = n + (n/2 + n/4 + n/8 + ... + 1) = n + n = O(n)
Amortized cost per insertion = O(n) / n = O(1)
This is the classic amortized analysis (same idea as vector::push_back doubling).
Deletion in Open Addressing: The Tombstone Problem
In chaining, deletion is straightforward -- just remove a node from a linked list. But in open addressing, deletion is tricky.
If you simply clear a slot, you break the probe chain for other elements:
Slots: [22, 88, _, _, 4, 15, 28, 17, _, 31, 10]
Indices: 0-10
Suppose 88 was placed at index 1 because 0 was occupied (linear probing).
If we delete 22 (index 0):
Slots: [ _, 88, _, _, 4, 15, 28, 17, _, 31, 10]
Now search for 88: h(88)=0, slot 0 is empty --> "88 not found!" WRONG!
Solution: Tombstones (lazy deletion)
Instead of clearing the slot, mark it as "deleted" (a tombstone). During lookup, skip over tombstones. During insertion, tombstones can be reused.
Slots: [DEL, 88, _, _, 4, 15, 28, 17, _, 31, 10]
Search for 88: h(88)=0, slot 0 is DEL (skip), slot 1 has 88 --> found!
Downside: too many tombstones degrade performance. Periodic rehashing cleans them up.
C++ Implementation: HashMap from Scratch
#include <bits/stdc++.h>
using namespace std;
template<typename K, typename V>
class HashMap {
struct Node {
K key;
V value;
Node* next;
Node(K k, V v) : key(k), value(v), next(nullptr) {}
};
vector<Node*> buckets;
int sz;
int capacity;
float maxLoadFactor;
int getBucket(const K& key) const {
return std::hash<K>{}(key) % capacity;
}
void rehash() {
int oldCap = capacity;
capacity *= 2;
vector<Node*> oldBuckets = move(buckets);
buckets.assign(capacity, nullptr);
sz = 0;
for (int i = 0; i < oldCap; i++) {
Node* curr = oldBuckets[i];
while (curr) {
put(curr->key, curr->value);
Node* temp = curr;
curr = curr->next;
delete temp;
}
}
}
public:
HashMap(int cap = 16) : capacity(cap), sz(0), maxLoadFactor(0.75f) {
buckets.assign(capacity, nullptr);
}
~HashMap() {
for (int i = 0; i < capacity; i++) {
Node* curr = buckets[i];
while (curr) {
Node* temp = curr;
curr = curr->next;
delete temp;
}
}
}
// Insert or update
void put(const K& key, const V& value) {
if ((float)sz / capacity >= maxLoadFactor) {
rehash();
}
int idx = getBucket(key);
Node* curr = buckets[idx];
// Check if key already exists -- update
while (curr) {
if (curr->key == key) {
curr->value = value;
return;
}
curr = curr->next;
}
// Key not found -- prepend new node (O(1) insertion)
Node* newNode = new Node(key, value);
newNode->next = buckets[idx];
buckets[idx] = newNode;
sz++;
}
// Lookup -- returns pointer to value, or nullptr if not found
V* get(const K& key) {
int idx = getBucket(key);
Node* curr = buckets[idx];
while (curr) {
if (curr->key == key) return &curr->value;
curr = curr->next;
}
return nullptr;
}
// Delete -- returns true if key was found and removed
bool remove(const K& key) {
int idx = getBucket(key);
Node* curr = buckets[idx];
Node* prev = nullptr;
while (curr) {
if (curr->key == key) {
if (prev) prev->next = curr->next;
else buckets[idx] = curr->next;
delete curr;
sz--;
return true;
}
prev = curr;
curr = curr->next;
}
return false;
}
bool contains(const K& key) {
return get(key) != nullptr;
}
int size() const { return sz; }
bool empty() const { return sz == 0; }
// Operator [] for convenient access (inserts default if missing)
V& operator[](const K& key) {
V* val = get(key);
if (val) return *val;
put(key, V{});
return *get(key);
}
};
// Usage example
int main() {
HashMap<string, int> map;
map.put("apple", 5);
map.put("banana", 3);
map.put("cherry", 7);
cout << *map.get("apple") << endl; // 5
cout << map["banana"] << endl; // 3
map.remove("cherry");
cout << map.contains("cherry") << endl; // 0 (false)
// Overwrite
map.put("apple", 10);
cout << *map.get("apple") << endl; // 10
return 0;
}
Open Addressing Implementation
For completeness, here is a linear-probing based implementation:
#include <bits/stdc++.h>
using namespace std;
template<typename K, typename V>
class OpenAddressMap {
enum SlotState { EMPTY, OCCUPIED, DELETED };
struct Slot {
K key;
V value;
SlotState state = EMPTY;
};
vector<Slot> table;
int sz;
int capacity;
int getBucket(const K& key) const {
return std::hash<K>{}(key) % capacity;
}
void rehash() {
vector<Slot> oldTable = move(table);
capacity *= 2;
table.assign(capacity, Slot());
sz = 0;
for (auto& slot : oldTable) {
if (slot.state == OCCUPIED)
put(slot.key, slot.value);
}
}
public:
OpenAddressMap(int cap = 16) : capacity(cap), sz(0) {
table.assign(capacity, Slot());
}
void put(const K& key, const V& value) {
if (sz * 2 >= capacity) rehash(); // keep alpha < 0.5
int idx = getBucket(key);
int firstDeleted = -1;
for (int i = 0; i < capacity; i++) {
int probe = (idx + i) % capacity;
if (table[probe].state == EMPTY) {
int target = (firstDeleted != -1) ? firstDeleted : probe;
table[target] = {key, value, OCCUPIED};
sz++;
return;
}
if (table[probe].state == DELETED && firstDeleted == -1) {
firstDeleted = probe;
}
if (table[probe].state == OCCUPIED && table[probe].key == key) {
table[probe].value = value; // update existing
return;
}
}
// Table is full (shouldn't happen with proper rehashing)
if (firstDeleted != -1) {
table[firstDeleted] = {key, value, OCCUPIED};
sz++;
}
}
V* get(const K& key) {
int idx = getBucket(key);
for (int i = 0; i < capacity; i++) {
int probe = (idx + i) % capacity;
if (table[probe].state == EMPTY) return nullptr;
if (table[probe].state == OCCUPIED && table[probe].key == key)
return &table[probe].value;
// If DELETED, keep probing
}
return nullptr;
}
bool remove(const K& key) {
int idx = getBucket(key);
for (int i = 0; i < capacity; i++) {
int probe = (idx + i) % capacity;
if (table[probe].state == EMPTY) return false;
if (table[probe].state == OCCUPIED && table[probe].key == key) {
table[probe].state = DELETED; // tombstone
sz--;
return true;
}
}
return false;
}
};
Complexity Analysis
When does worst case happen?
All n keys hash to the same bucket. The HashMap degenerates into a linked list (for chaining)
or a full scan of the table (for open addressing). Every operation becomes O(n).
This is not just theoretical -- in competitive programming, adversaries craft inputs specifically to trigger worst-case behavior. See the CP tips section below.
Space overhead:
- Chaining:
mpointers (buckets) +nnodes (each with key, value, next pointer). Typical overhead is ~32-48 bytes per entry due to pointer and allocation overhead. - Open addressing:
mslots, each holding a key-value pair. No pointer overhead, but requiresm > n(load factor < 1), so some slots are wasted.
unordered_map vs map in C++
When to use which:
- unordered_map: Default choice when you need fast lookups and don't care about order.
- map: When you need ordered iteration, range queries (
lower_bound,upper_bound), or guaranteed O(log n) worst case. - Tip: If
n < 100, a sortedvectorwith binary search often beats both due to cache locality.
Competitive Programming Tips
Anti-Hash Tests and Custom Hash Functions
Some competitive programming problems have test cases specifically designed to cause hash
collisions and make unordered_map run in O(n^2). This is possible because GCC's
unordered_map uses a simple modular hash, and the bucket count is predictable (powers of
2 or specific primes).
The attack:
The attacker knows the bucket count sequence and crafts keys that all hash to the same bucket. For GCC, bucket counts are primes from a fixed table: 13, 29, 53, 97, 193, 389, 769, ... or powers of 2 depending on the implementation.
The defense: Use a randomized custom hash.
struct custom_hash {
static uint64_t splitmix64(uint64_t x) {
// http://xorshift.di.unimi.it/splitmix64.c
x += 0x9e3779b97f4a7c15;
x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9;
x = (x ^ (x >> 27)) * 0x94d049bb133111eb;
return x ^ (x >> 31);
}
size_t operator()(uint64_t x) const {
// XOR with a random seed so the hash is unpredictable
static const uint64_t FIXED_RANDOM =
chrono::steady_clock::now().time_since_epoch().count();
return splitmix64(x + FIXED_RANDOM);
}
};
// Usage:
unordered_map<int, int, custom_hash> safe_map;
Why this works: The FIXED_RANDOM seed changes every execution, making it impossible for
an attacker to predict which keys will collide.
For Pairs as Keys
struct pair_hash {
size_t operator()(const pair<int,int>& p) const {
auto h1 = custom_hash{}(p.first);
auto h2 = custom_hash{}(p.second);
return h1 ^ (h2 << 32); // combine hashes
}
};
unordered_map<pair<int,int>, int, pair_hash> pairMap;
Performance Tips
// Reserve capacity upfront to avoid rehashing
unordered_map<int, int, custom_hash> mp;
mp.reserve(1 << 20); // reserve 2^20 buckets
mp.max_load_factor(0.25); // lower load factor = fewer collisions
// For small key ranges, use a plain array instead!
int freq[1000001] = {}; // much faster than unordered_map
Common Pitfall: Iterating While Modifying
// WRONG -- undefined behavior!
for (auto it = mp.begin(); it != mp.end(); ++it) {
if (it->second == 0) mp.erase(it); // invalidates iterator!
}
// CORRECT
for (auto it = mp.begin(); it != mp.end(); ) {
if (it->second == 0) it = mp.erase(it);
else ++it;
}
Interview Questions
1. Design a HashMap from Scratch
This is a classic system design / OOD question. The implementation above covers it completely. Key points to discuss:
- Choice of hash function
- Collision resolution strategy
- Rehashing policy and amortized analysis
- Thread safety (mention
ConcurrentHashMapfor bonus points)
2. Two Sum (LeetCode 1)
Problem: Given array nums and target, find two indices whose values sum to target.
Intuition: For each number x, we need target - x. A HashMap lets us check this in O(1).
vector<int> twoSum(vector<int>& nums, int target) {
unordered_map<int, int> seen; // value -> index
for (int i = 0; i < nums.size(); i++) {
int complement = target - nums[i];
if (seen.count(complement)) {
return {seen[complement], i};
}
seen[nums[i]] = i;
}
return {}; // no solution
}
// Time: O(n), Space: O(n)
3. Group Anagrams (LeetCode 49)
Problem: Group strings that are anagrams of each other.
Intuition: Two strings are anagrams if they have the same sorted form. Use sorted string as the HashMap key.
vector<vector<string>> groupAnagrams(vector<string>& strs) {
unordered_map<string, vector<string>> groups;
for (auto& s : strs) {
string key = s;
sort(key.begin(), key.end());
groups[key].push_back(s);
}
vector<vector<string>> result;
for (auto& [key, group] : groups) {
result.push_back(move(group));
}
return result;
}
// Time: O(n * k log k) where k is max string length
// Space: O(n * k)
Optimization: Instead of sorting, use character frequency as the key:
string getKey(const string& s) {
int count[26] = {};
for (char c : s) count[c - 'a']++;
string key;
for (int i = 0; i < 26; i++) {
key += '#';
key += to_string(count[i]);
}
return key;
}
// Now grouping is O(n * k) instead of O(n * k log k)
4. LRU Cache (LeetCode 146)
Problem: Design a cache with capacity k. On get(key), return value and mark as recently
used. On put(key, value), insert/update and evict the least recently used item if at capacity.
Intuition: HashMap for O(1) lookup + Doubly Linked List for O(1) ordering.
class LRUCache {
int capacity;
list<pair<int,int>> dll; // {key, value}
unordered_map<int, list<pair<int,int>>::iterator> map; // key -> iterator
void moveToFront(int key, int value) {
if (map.count(key)) dll.erase(map[key]);
dll.push_front({key, value});
map[key] = dll.begin();
}
public:
LRUCache(int capacity) : capacity(capacity) {}
int get(int key) {
if (!map.count(key)) return -1;
int val = map[key]->second;
moveToFront(key, val);
return val;
}
void put(int key, int value) {
moveToFront(key, value);
if ((int)map.size() > capacity) {
int lruKey = dll.back().first;
dll.pop_back();
map.erase(lruKey);
}
}
};
// All operations O(1)
5. Subarray Sum Equals K (LeetCode 560)
Problem: Count subarrays whose elements sum to k.
Intuition: Use prefix sums. If prefix[j] - prefix[i] = k, then subarray [i+1..j] has
sum k. We need to count how many previous prefix sums equal prefix[j] - k.
int subarraySum(vector<int>& nums, int k) {
unordered_map<int, int> prefixCount;
prefixCount[0] = 1; // empty prefix has sum 0
int sum = 0, count = 0;
for (int x : nums) {
sum += x;
if (prefixCount.count(sum - k))
count += prefixCount[sum - k];
prefixCount[sum]++;
}
return count;
}
// Time: O(n), Space: O(n)
Advanced: How C++ unordered_map Actually Works (GCC Implementation)
Under the hood, GCC's unordered_map uses separate chaining with a single linked list.
This design means:
- Iteration is O(n), not O(n + m), because we just walk the single list.
- Rehashing only needs to rearrange pointers in the list and update the bucket array.
- But each node requires heap allocation, which is slow (cache-unfriendly).
For high-performance scenarios, consider alternatives:
absl::flat_hash_map(Google's open-addressing map with SIMD probing)robin_hood::unordered_map(Robin Hood hashing, very cache-friendly)ska::flat_hash_map(similar to absl)
Summary
Key takeaways:
- Hash functions map keys to indices. Good hash functions distribute uniformly.
- Collisions are inevitable. Handle them with chaining or open addressing.
- Rehashing keeps the load factor low, ensuring O(1) amortized operations.
- In competitive programming, use custom hash functions to avoid adversarial inputs.
unordered_mapis O(1) average but O(n) worst-case.mapis O(log n) guaranteed.