Collision Handling
What Is a Collision and Why Is It Inevitable?
A collision occurs when two different keys produce the same index in a hash table. Given a table of size n and any hash function, the birthday paradox guarantees a collision appears after inserting roughly √n elements — not n elements. For a table of size 100, expect a collision after roughly 10 insertions.
Table size = 7, hash function: h(k) = k % 7
Insert 3: h(3) = 3 → slot 3 is empty → store at 3 ✓
Insert 10: h(10) = 3 → slot 3 is OCCUPIED → COLLISION!
Insert "alice": h("alice") = 4 → store at 4 ✓
Insert "charlie": h("charlie") = 4 → COLLISION!
Collisions cannot be prevented in a general-purpose hash table — only managed. The two foundational strategies are chaining (each slot holds a list) and open addressing (probe for the next empty slot). Every production hash table uses one of these two approaches.
Strategy 1: Separate Chaining
In separate chaining, each slot in the hash table array holds a pointer to a linked list (or another dynamic structure). When two keys collide, both entries go into the list at that slot.
Table with chaining (size 7):
Index Chain
0 → null
1 → [("bob", 72)]
2 → null
3 → [("alice", 95)] → [("charlie", 61)] ← two keys hashed to 3
4 → [("dave", 88)]
5 → null
6 → null
Lookup "charlie":
h("charlie") = 3
Scan chain at 3: ("alice", 95) — not a match
Next: ("charlie", 61) — match! Return 61
Cost: O(1) hash + O(k) chain scan where k = chain length
Chaining Implementation from Scratch
1public class ChainingHashTable {
2
3 private static class Entry {
4 String key;
5 int value;
6 Entry next;
7
8 Entry(String key, int value) {
9 this.key = key;
10 this.value = value;
11 this.next = null;
12 }
13 }
14
15 private Entry[] table;
16 private int capacity;
17 private int size;
18
19 public ChainingHashTable(int capacity) {
20 this.capacity = capacity;
21 this.table = new Entry[capacity];
22 this.size = 0;
23 }
24
25 private int hash(String key) {
26 int h = 0;
27 for (char c : key.toCharArray()) h = 31 * h + c;
28 return Math.abs(h % capacity);
29 }
30
31 // Insert or update — O(k) where k = chain length at slot
32 public void put(String key, int value) {
33 int idx = hash(key);
34 Entry curr = table[idx];
35
36 // Search chain for existing key
37 while (curr != null) {
38 if (curr.key.equals(key)) {
39 curr.value = value; // Update existing
40 return;
41 }
42 curr = curr.next;
43 }
44
45 // Prepend new entry to chain — O(1)
46 Entry newEntry = new Entry(key, value);
47 newEntry.next = table[idx];
48 table[idx] = newEntry;
49 size++;
50 }
51
52 // Lookup — O(k) where k = chain length
53 public Integer get(String key) {
54 int idx = hash(key);
55 Entry curr = table[idx];
56
57 while (curr != null) {
58 if (curr.key.equals(key)) return curr.value;
59 curr = curr.next;
60 }
61
62 return null; // Not found
63 }
64
65 // Delete — O(k) where k = chain length
66 public boolean remove(String key) {
67 int idx = hash(key);
68 Entry curr = table[idx];
69 Entry prev = null;
70
71 while (curr != null) {
72 if (curr.key.equals(key)) {
73 if (prev == null) table[idx] = curr.next; // Remove head
74 else prev.next = curr.next; // Remove middle/tail
75 size--;
76 return true;
77 }
78 prev = curr;
79 curr = curr.next;
80 }
81
82 return false; // Not found
83 }
84
85 public static void main(String[] args) {
86 ChainingHashTable ht = new ChainingHashTable(7);
87
88 ht.put("alice", 95);
89 ht.put("bob", 72);
90 ht.put("charlie", 61);
91 ht.put("dave", 88);
92
93 System.out.println("alice: " + ht.get("alice")); // 95
94 System.out.println("charlie: " + ht.get("charlie")); // 61
95 System.out.println("eve: " + ht.get("eve")); // null
96
97 ht.put("alice", 99); // Update
98 System.out.println("alice updated: " + ht.get("alice")); // 99
99
100 ht.remove("bob");
101 System.out.println("bob after remove: " + ht.get("bob")); // null
102 }
103}Output:
alice: 95
charlie: 61
eve: null
alice updated: 99
bob after remove: null
Dry Run: Insert and Lookup with Chaining
ChainingHashTable, capacity=7, hash(key) = polynomial % 7
Insert ("alice", 95):
h("alice") = 4
table[4] is empty → prepend → table[4] = [("alice", 95)]
Insert ("charlie", 61):
h("charlie") = 4 ← COLLISION with "alice"!
table[4] is not empty → scan chain: "alice" ≠ "charlie"
No match → prepend → table[4] = [("charlie", 61) → ("alice", 95)]
Lookup "alice":
h("alice") = 4
Scan table[4]:
("charlie", 61) → key "charlie" ≠ "alice" → continue
("alice", 95) → key "alice" = "alice" → return 95 ✓
Lookup "eve":
h("eve") = 2
table[2] is empty → return null
Chain length at slot 4: 2 entries
Expected chain length with load factor α: O(α) = O(n/capacity)
Strategy 2: Open Addressing
In open addressing, there are no linked lists. All entries are stored directly in the table array. When a collision occurs, the table is probed — a sequence of alternative slots is tried until an empty slot is found.
The probe sequence defines the "open addressing" variant.
Probe Sequence Types
Linear Probing: probe(i) = (h(key) + i) % capacity (i = 0, 1, 2, 3, ...) Try slot h(key), then h(key)+1, then h(key)+2, ... Simple but causes PRIMARY CLUSTERING — long runs form Quadratic Probing: probe(i) = (h(key) + i²) % capacity (i = 0, 1, 2, 3, ...) Try slot h(key), then h(key)+1, then h(key)+4, then h(key)+9, ... Reduces primary clustering but causes SECONDARY CLUSTERING Double Hashing: probe(i) = (h1(key) + i × h2(key)) % capacity Two independent hash functions — step size varies per key Best distribution, no clustering Requirement: h2(key) must never be 0 and must be coprime to capacity
Linear Probing Implementation
1public class LinearProbingHashTable {
2
3 private static final String DELETED = "__DELETED__"; // Tombstone sentinel
4
5 private String[] keys;
6 private int[] values;
7 private int capacity;
8 private int size;
9
10 public LinearProbingHashTable(int capacity) {
11 this.capacity = capacity;
12 this.keys = new String[capacity];
13 this.values = new int[capacity];
14 this.size = 0;
15 }
16
17 private int hash(String key) {
18 int h = 0;
19 for (char c : key.toCharArray()) h = 31 * h + c;
20 return Math.abs(h % capacity);
21 }
22
23 // Insert — O(1) average with low load factor
24 public void put(String key, int value) {
25 int idx = hash(key);
26
27 // Probe forward until empty or same key
28 while (keys[idx] != null && !keys[idx].equals(DELETED)) {
29 if (keys[idx].equals(key)) {
30 values[idx] = value; // Update existing
31 return;
32 }
33 idx = (idx + 1) % capacity; // Linear probe: next slot
34 }
35
36 keys[idx] = key;
37 values[idx] = value;
38 size++;
39 }
40
41 // Lookup — O(1) average
42 public Integer get(String key) {
43 int idx = hash(key);
44 int start = idx;
45
46 while (keys[idx] != null) {
47 if (!keys[idx].equals(DELETED) && keys[idx].equals(key)) {
48 return values[idx];
49 }
50 idx = (idx + 1) % capacity;
51 if (idx == start) break; // Full circle — not found
52 }
53
54 return null;
55 }
56
57 // Delete — must use tombstone, not null (would break probe chains)
58 public boolean remove(String key) {
59 int idx = hash(key);
60 int start = idx;
61
62 while (keys[idx] != null) {
63 if (!keys[idx].equals(DELETED) && keys[idx].equals(key)) {
64 keys[idx] = DELETED; // Tombstone — preserves probe chain
65 size--;
66 return true;
67 }
68 idx = (idx + 1) % capacity;
69 if (idx == start) break;
70 }
71
72 return false;
73 }
74
75 public static void main(String[] args) {
76 LinearProbingHashTable ht = new LinearProbingHashTable(11);
77
78 ht.put("alice", 95);
79 ht.put("bob", 72);
80 ht.put("charlie", 61);
81 ht.put("dave", 88);
82
83 System.out.println("alice: " + ht.get("alice")); // 95
84 System.out.println("charlie: " + ht.get("charlie")); // 61
85 System.out.println("eve: " + ht.get("eve")); // null
86
87 ht.remove("bob");
88 System.out.println("bob removed: " + ht.get("bob")); // null
89 // alice/charlie/dave still findable — tombstone preserves probe chain
90 System.out.println("dave after bob removed: " + ht.get("dave")); // 88
91 }
92}Output:
alice: 95
charlie: 61
eve: null
bob removed: null
dave still found: 88
Dry Run: Linear Probing — Insert, Collision, Lookup
Table capacity = 7, h("alice") = 3, h("charlie") = 3
Step 1 — Insert ("alice", 95):
idx = h("alice") = 3
table[3] is EMPTY → store here
table: [null, null, null, (alice,95), null, null, null]
Step 2 — Insert ("charlie", 61):
idx = h("charlie") = 3
table[3] is OCCUPIED by "alice" → COLLISION
Probe: idx = (3 + 1) % 7 = 4
table[4] is EMPTY → store here
table: [null, null, null, (alice,95), (charlie,61), null, null]
Lookup "charlie":
idx = h("charlie") = 3
table[3].key = "alice" ≠ "charlie" → probe
idx = 4
table[4].key = "charlie" = "charlie" → return 61 ✓
Why null cannot be used as tombstone:
Delete "alice" (at slot 3) — if we set table[3] = null:
Lookup "charlie": idx=3 → table[3] is null → STOP
Returns null — but "charlie" is at slot 4!
The null terminates the probe early, hiding entries behind it.
Tombstone (DELETED marker) allows probe to continue past deleted slots.
The Tombstone Problem — Why Deletion Is Tricky in Open Addressing
This is the most subtle aspect of open addressing. When you delete an entry, you cannot simply set the slot to null — doing so would break the probe chains of other entries.
Scenario showing the problem:
h("A") = h("B") = h("C") = 2 (all collide at slot 2)
After inserting A, B, C with linear probing:
slot 2 → A
slot 3 → B (A occupied slot 2 — B probed to slot 3)
slot 4 → C (A and B occupied — C probed to slot 4)
Delete B (slot 3) — set slot 3 to null:
table: [null, null, A, null, C, null, null]
Lookup C:
idx = h("C") = 2
table[2] = A ≠ C → probe
idx = 3
table[3] = null → STOP (probe terminates at null)
Returns null — C is at slot 4 but is now unreachable!
Correct fix — tombstone deletion:
Set slot 3 to DELETED (not null)
table: [null, null, A, DELETED, C, null, null]
Lookup C now:
idx = h("C") = 2
table[2] = A ≠ C → probe
idx = 3
table[3] = DELETED → continue probing (not null, keep going)
idx = 4
table[4] = C = C → return value ✓
Chaining vs Open Addressing: Performance Comparison
The right strategy depends on load factor, cache behavior, and memory constraints.
CHAINING: Advantages: - Works well at any load factor (even > 1.0) - Deletion is simple — remove from linked list, no tombstones - Performance degrades gracefully as load increases Disadvantages: - Extra memory for linked list pointers (16+ bytes per node in Java) - Cache unfriendly — list nodes are scattered in memory - Pointer chasing on every chain traversal Performance: - Average lookup: O(1 + α) where α = load factor = n/capacity - α = 0.75: expected 1.75 comparisons per lookup - α = 2.0: expected 3 comparisons per lookup (still usable) OPEN ADDRESSING (linear probing): Advantages: - No extra memory for pointers — all data in one array - Excellent cache performance — probing accesses adjacent memory - Faster in practice at low load factors (α < 0.7) Disadvantages: - Cannot exceed load factor of 1.0 (no room for another entry) - Primary clustering — long runs degrade performance rapidly - Deletion requires tombstones — accumulate over time Performance: - Average successful lookup: O(1/(1-α)) with linear probing - α = 0.5: ~1.5 probes - α = 0.7: ~2.5 probes ← acceptable - α = 0.9: ~5.5 probes ← slow - α → 1.0: → infinity ← table full, must resize
Side-by-Side Example: Same Keys, Same Collisions
Keys to insert: "alice"→h=2, "bob"→h=5, "carl"→h=2, "diana"→h=5
Table capacity: 7
CHAINING:
slot 2 → [("carl",_) → ("alice",_)] (both hash to 2)
slot 5 → [("diana",_) → ("bob",_)] (both hash to 5)
Other slots: null
Memory: 7 array slots + 4 linked list nodes
OPEN ADDRESSING (linear probing):
slot 2 → ("alice",_) (first to arrive)
slot 3 → ("carl",_) (probed from slot 2)
slot 5 → ("bob",_) (first to arrive)
slot 6 → ("diana",_) (probed from slot 5)
Memory: 7 array slots, no extra nodes — all inline
Clustering: The Performance Enemy of Linear Probing
Primary clustering is the main weakness of linear probing. Once a long run of occupied slots forms, it attracts more keys — because any key that lands anywhere in the run gets appended to its end.
Initial: [_, _, _, A, B, C, D, _, E, _, _] 0 1 2 3 4 5 6 7 8 9 10 Keys A,B,C,D,E form a cluster from slots 3-8. Insert F where h(F)=3: Slot 3 occupied → probe 4 occupied → probe 5 occupied → probe 6 occupied → probe 7 EMPTY → store F at slot 7 [_, _, _, A, B, C, D, F, E, _, _] ← cluster now spans 3-8 Insert G where h(G)=6: Slot 6 occupied → probe 7 occupied → probe 8 occupied → probe 9 EMPTY → store G at slot 9 Primary clustering: the cluster grows at both ends, attracting more keys. Lookup cost for any key in the cluster is O(cluster_length) in the worst case. Solutions: Quadratic probing: spread probes further (i², reduces primary clustering) Double hashing: per-key step size (nearly eliminates clustering) Keep load factor below 0.7 for linear probing
How Production Hash Tables Handle Collisions
You rarely implement collision handling from scratch. Understanding what the standard library uses explains the performance you observe.
Java HashMap (Java 8+):
Strategy: Separate chaining
Data structure in each bucket: linked list for chains ≤ 8 nodes,
red-black tree for chains > 8 nodes
Why tree at 8?: O(log n) per lookup even in worst case — prevents
denial-of-service attacks from adversarial keys
Load factor threshold: 0.75
Growth: doubles capacity on resize
Python dict (CPython 3.6+):
Strategy: Open addressing with pseudo-random probing
Probe sequence uses a combination of linear and perturbation
(not pure linear probing — avoids primary clustering)
Load factor threshold: ~0.66
Growth: ~4× for small dicts, 2× for larger
Compact representation: keys, values, and hash codes stored in
parallel arrays for cache efficiency
C++ unordered_map:
Strategy: Separate chaining (implementation-defined, but all major
compilers use chaining)
Load factor threshold: 1.0 by default (configurable via max_load_factor())
Note: C++ unordered_map is known to be slower in practice than
custom open-addressing implementations due to cache misses
JavaScript Map (V8 engine):
Strategy: Implementation-specific — V8 uses hash tables with
open addressing internally for its hidden classes and maps
Load factor management is automatic and opaque to the user
Interview Questions
Q: What is a hash collision and how does chaining resolve it?
A collision occurs when two different keys produce the same array index after applying the hash function and modulo. Separate chaining resolves it by allowing multiple entries to coexist at the same slot. Each slot holds a linked list (or similar structure). On insert, the new entry is prepended to the list at the colliding slot. On lookup, the entire list at that slot is scanned linearly. The average chain length equals the load factor α = n/capacity, so O(1 + α) per operation. With α < 0.75, chains stay short and performance remains near O(1).
Q: Why can't you simply set a slot to null when deleting in open addressing?
In open addressing, lookup stops probing when it hits a null slot, interpreting null as "this key does not exist." If an entry was deleted by setting its slot to null, any entry that was pushed past the deleted slot during a previous collision would become unreachable. The probe sequence that originally bypassed that slot would now stop at the null before reaching the target. The fix is to use a tombstone marker — a special sentinel that tells the probe "keep going, something was here but was deleted."
Q: What is primary clustering in linear probing and why does it hurt performance?
Primary clustering occurs when adjacent occupied slots form long contiguous runs. Any key that hashes to any position in the run gets appended to the end of the run during insertion. This means the run grows larger with every collision, attracting more keys exponentially. Lookup cost for keys in or near the cluster increases linearly with cluster length. Quadratic probing reduces primary clustering by spreading probes quadratically, and double hashing nearly eliminates it by using a per-key step size.
Q: Why does Java 8 use a tree instead of a list in long chains?
With a linked list, a bucket with n entries has O(n) lookup. An adversary who knows the hash function can force all keys to the same bucket — turning O(1) into O(n) for every operation. By switching to a red-black tree when a chain exceeds 8 entries, Java caps the worst-case per-bucket lookup at O(log n). This prevents hash-flooding denial-of-service attacks without sacrificing O(1) average-case performance for normal inputs.
FAQs
Which is better in practice — chaining or open addressing?
For general-purpose use, both perform comparably at low load factors. Open addressing (specifically linear probing or Robin Hood hashing) tends to be faster in practice for read-heavy workloads because all data is stored inline in a single array — better cache locality means faster memory access. Chaining is better when the load factor exceeds 0.7, when deletion is frequent (no tombstone accumulation), or when entries vary wildly in size. Most modern high-performance hash tables (Google's absl::flat_hash_map, Rust's HashMap) use open addressing with quadratic or Robin Hood probing.
What is Robin Hood hashing?
Robin Hood hashing is a variant of linear probing that steals slots from "rich" entries (those that are close to their ideal slot) and gives them to "poor" entries (those far from their ideal slot). During insertion, if a new entry's probe distance exceeds the current occupant's probe distance, they swap — the new entry takes the slot and the displaced entry continues probing. This minimizes variance in probe distances and gives much better worst-case performance than standard linear probing.
Does the load factor apply equally to chaining and open addressing?
No. Chaining works at any load factor — even α > 1.0 (more entries than slots). The cost grows linearly as α increases, but the table never "runs out of space." Open addressing requires α < 1.0 — the table is physically full at α = 1.0. In practice, open addressing tables resize much earlier (at α = 0.5 to 0.7) to prevent the severe performance degradation that occurs near full capacity. This is why open addressing typically wastes more memory per entry than chaining.
Quick Quiz
Question 1: With linear probing and table size 7, keys "A" and "B" both hash to slot 3. "A" is inserted first. Where does "B" end up?
- ›A) Slot 3 — both can share the slot
- ›B) Slot 4 — linear probe finds the next empty slot
- ›C) Slot 0 — wraps around from the end
- ›D) The insert fails — slot 3 is taken
Answer: B) Slot 4. Linear probing tries (h(k) + i) % capacity for i = 0, 1, 2, ... Slot 3 is occupied by "A", so probe tries slot (3 + 1) % 7 = 4. If slot 4 is empty, "B" is stored there. Open addressing does not allow sharing slots — each slot holds at most one entry.
Question 2: You delete an entry from a linear probing table by setting its slot to null. What goes wrong?
- ›A) Nothing — null is a valid deletion marker
- ›B) The slot is permanently unusable
- ›C) Entries inserted past this slot during a prior collision become unreachable
- ›D) The hash function stops working for that index
Answer: C) Entries inserted past this slot become unreachable. Lookup stops at null, interpreting it as "key not present." If a colliding entry was probed past this slot and stored at a later position, the lookup terminates before reaching it. The fix is a tombstone marker that signals "deleted but keep probing."
Question 3: Java 8 switches from a linked list to a red-black tree in a hash bucket when the chain length exceeds:
- ›A) 4 entries
- ›B) 8 entries
- ›C) 16 entries
- ›D) When load factor exceeds 0.75
Answer: B) 8 entries. Java switches to a red-black tree at 8 entries per bucket, capping worst-case lookup at O(log n) instead of O(n). This prevents hash-flooding attacks where an adversary forces all keys to the same bucket. The switch back from tree to list occurs at 6 entries (hysteresis prevents thrashing).
Question 4: A chaining hash table has capacity 10, load factor α = 0.9. What is the expected number of comparisons to find a key?
- ›A) O(1) — load factor does not affect chaining
- ›B) O(0.9) — directly proportional to load factor
- ›C) O(1 + α) = O(1.9) — one hash computation plus expected chain length
- ›D) O(n) — near-full table means full scan
Answer: C) O(1 + α) = O(1.9). The expected chain length in chaining equals α (by the simple uniform hashing assumption). Lookup costs one hash computation plus scanning the chain — O(1 + α). At α = 0.9, that is 1.9 comparisons on average. Chaining degrades gracefully as α increases, unlike open addressing which degrades catastrophically near α = 1.0.
Summary
Every hash table collision is resolved by one of two fundamental strategies: separate chaining (each slot holds a list) or open addressing (probe for the next empty slot).
The key ideas to carry forward:
- ›Collisions are guaranteed — the birthday paradox ensures them well before the table is full
- ›Chaining: each slot is a linked list — simple deletion, works at any load factor, but cache-unfriendly and uses extra memory for pointers
- ›Open addressing: all entries inline in the array — cache-friendly and memory-efficient, but requires load factor < 1.0 and tombstone deletion
- ›Tombstone: the correct way to delete in open addressing — null termination breaks probe chains
- ›Primary clustering (linear probing's weakness): long runs of occupied slots form and grow, degrading lookup performance
- ›Quadratic probing and double hashing reduce clustering at the cost of implementation complexity
- ›Java HashMap uses chaining with red-black trees for long chains (> 8) to cap worst-case performance at O(log n)
- ›Python dict uses open addressing with pseudo-random probing and a ~0.66 load factor threshold
- ›Keep load factor below 0.75 (chaining) or 0.7 (open addressing) for O(1) average performance
In the next topic, you will explore Frequency Counting — using hash maps to count occurrences, build histograms, and solve the broad class of problems that require knowing how many times each element appears.