DSA Tutorial
🔍

Collision Handling

What Is a Collision and Why Is It Inevitable?

A collision occurs when two different keys produce the same index in a hash table. Given a table of size n and any hash function, the birthday paradox guarantees a collision appears after inserting roughly √n elements — not n elements. For a table of size 100, expect a collision after roughly 10 insertions.

Table size = 7, hash function: h(k) = k % 7

Insert 3:  h(3)  = 3 → slot 3 is empty → store at 3 ✓
Insert 10: h(10) = 3 → slot 3 is OCCUPIED → COLLISION!

Insert "alice":   h("alice")  = 4 → store at 4 ✓
Insert "charlie": h("charlie") = 4 → COLLISION!

Collisions cannot be prevented in a general-purpose hash table — only managed. The two foundational strategies are chaining (each slot holds a list) and open addressing (probe for the next empty slot). Every production hash table uses one of these two approaches.

Strategy 1: Separate Chaining

In separate chaining, each slot in the hash table array holds a pointer to a linked list (or another dynamic structure). When two keys collide, both entries go into the list at that slot.

Table with chaining (size 7):

Index  Chain
  0  → null
  1  → [("bob", 72)]
  2  → null
  3  → [("alice", 95)] → [("charlie", 61)]  ← two keys hashed to 3
  4  → [("dave", 88)]
  5  → null
  6  → null

Lookup "charlie":
  h("charlie") = 3
  Scan chain at 3: ("alice", 95) — not a match
  Next:           ("charlie", 61) — match! Return 61
  Cost: O(1) hash + O(k) chain scan where k = chain length

Chaining Implementation from Scratch

Java
1public class ChainingHashTable { 2 3 private static class Entry { 4 String key; 5 int value; 6 Entry next; 7 8 Entry(String key, int value) { 9 this.key = key; 10 this.value = value; 11 this.next = null; 12 } 13 } 14 15 private Entry[] table; 16 private int capacity; 17 private int size; 18 19 public ChainingHashTable(int capacity) { 20 this.capacity = capacity; 21 this.table = new Entry[capacity]; 22 this.size = 0; 23 } 24 25 private int hash(String key) { 26 int h = 0; 27 for (char c : key.toCharArray()) h = 31 * h + c; 28 return Math.abs(h % capacity); 29 } 30 31 // Insert or update — O(k) where k = chain length at slot 32 public void put(String key, int value) { 33 int idx = hash(key); 34 Entry curr = table[idx]; 35 36 // Search chain for existing key 37 while (curr != null) { 38 if (curr.key.equals(key)) { 39 curr.value = value; // Update existing 40 return; 41 } 42 curr = curr.next; 43 } 44 45 // Prepend new entry to chain — O(1) 46 Entry newEntry = new Entry(key, value); 47 newEntry.next = table[idx]; 48 table[idx] = newEntry; 49 size++; 50 } 51 52 // Lookup — O(k) where k = chain length 53 public Integer get(String key) { 54 int idx = hash(key); 55 Entry curr = table[idx]; 56 57 while (curr != null) { 58 if (curr.key.equals(key)) return curr.value; 59 curr = curr.next; 60 } 61 62 return null; // Not found 63 } 64 65 // Delete — O(k) where k = chain length 66 public boolean remove(String key) { 67 int idx = hash(key); 68 Entry curr = table[idx]; 69 Entry prev = null; 70 71 while (curr != null) { 72 if (curr.key.equals(key)) { 73 if (prev == null) table[idx] = curr.next; // Remove head 74 else prev.next = curr.next; // Remove middle/tail 75 size--; 76 return true; 77 } 78 prev = curr; 79 curr = curr.next; 80 } 81 82 return false; // Not found 83 } 84 85 public static void main(String[] args) { 86 ChainingHashTable ht = new ChainingHashTable(7); 87 88 ht.put("alice", 95); 89 ht.put("bob", 72); 90 ht.put("charlie", 61); 91 ht.put("dave", 88); 92 93 System.out.println("alice: " + ht.get("alice")); // 95 94 System.out.println("charlie: " + ht.get("charlie")); // 61 95 System.out.println("eve: " + ht.get("eve")); // null 96 97 ht.put("alice", 99); // Update 98 System.out.println("alice updated: " + ht.get("alice")); // 99 99 100 ht.remove("bob"); 101 System.out.println("bob after remove: " + ht.get("bob")); // null 102 } 103}
Output:
alice:   95
charlie: 61
eve:     null
alice updated: 99
bob after remove: null

Dry Run: Insert and Lookup with Chaining

ChainingHashTable, capacity=7, hash(key) = polynomial % 7

Insert ("alice", 95):
  h("alice") = 4
  table[4] is empty → prepend → table[4] = [("alice", 95)]

Insert ("charlie", 61):
  h("charlie") = 4   ← COLLISION with "alice"!
  table[4] is not empty → scan chain: "alice" ≠ "charlie"
  No match → prepend → table[4] = [("charlie", 61) → ("alice", 95)]

Lookup "alice":
  h("alice") = 4
  Scan table[4]:
    ("charlie", 61) → key "charlie" ≠ "alice" → continue
    ("alice", 95)   → key "alice" = "alice"   → return 95 ✓

Lookup "eve":
  h("eve") = 2
  table[2] is empty → return null

Chain length at slot 4: 2 entries
Expected chain length with load factor α: O(α) = O(n/capacity)

Strategy 2: Open Addressing

In open addressing, there are no linked lists. All entries are stored directly in the table array. When a collision occurs, the table is probed — a sequence of alternative slots is tried until an empty slot is found.

The probe sequence defines the "open addressing" variant.

Probe Sequence Types

Linear Probing:
  probe(i) = (h(key) + i) % capacity    (i = 0, 1, 2, 3, ...)
  Try slot h(key), then h(key)+1, then h(key)+2, ...
  Simple but causes PRIMARY CLUSTERING — long runs form

Quadratic Probing:
  probe(i) = (h(key) + i²) % capacity   (i = 0, 1, 2, 3, ...)
  Try slot h(key), then h(key)+1, then h(key)+4, then h(key)+9, ...
  Reduces primary clustering but causes SECONDARY CLUSTERING

Double Hashing:
  probe(i) = (h1(key) + i × h2(key)) % capacity
  Two independent hash functions — step size varies per key
  Best distribution, no clustering
  Requirement: h2(key) must never be 0 and must be coprime to capacity

Linear Probing Implementation

Java
1public class LinearProbingHashTable { 2 3 private static final String DELETED = "__DELETED__"; // Tombstone sentinel 4 5 private String[] keys; 6 private int[] values; 7 private int capacity; 8 private int size; 9 10 public LinearProbingHashTable(int capacity) { 11 this.capacity = capacity; 12 this.keys = new String[capacity]; 13 this.values = new int[capacity]; 14 this.size = 0; 15 } 16 17 private int hash(String key) { 18 int h = 0; 19 for (char c : key.toCharArray()) h = 31 * h + c; 20 return Math.abs(h % capacity); 21 } 22 23 // Insert — O(1) average with low load factor 24 public void put(String key, int value) { 25 int idx = hash(key); 26 27 // Probe forward until empty or same key 28 while (keys[idx] != null && !keys[idx].equals(DELETED)) { 29 if (keys[idx].equals(key)) { 30 values[idx] = value; // Update existing 31 return; 32 } 33 idx = (idx + 1) % capacity; // Linear probe: next slot 34 } 35 36 keys[idx] = key; 37 values[idx] = value; 38 size++; 39 } 40 41 // Lookup — O(1) average 42 public Integer get(String key) { 43 int idx = hash(key); 44 int start = idx; 45 46 while (keys[idx] != null) { 47 if (!keys[idx].equals(DELETED) && keys[idx].equals(key)) { 48 return values[idx]; 49 } 50 idx = (idx + 1) % capacity; 51 if (idx == start) break; // Full circle — not found 52 } 53 54 return null; 55 } 56 57 // Delete — must use tombstone, not null (would break probe chains) 58 public boolean remove(String key) { 59 int idx = hash(key); 60 int start = idx; 61 62 while (keys[idx] != null) { 63 if (!keys[idx].equals(DELETED) && keys[idx].equals(key)) { 64 keys[idx] = DELETED; // Tombstone — preserves probe chain 65 size--; 66 return true; 67 } 68 idx = (idx + 1) % capacity; 69 if (idx == start) break; 70 } 71 72 return false; 73 } 74 75 public static void main(String[] args) { 76 LinearProbingHashTable ht = new LinearProbingHashTable(11); 77 78 ht.put("alice", 95); 79 ht.put("bob", 72); 80 ht.put("charlie", 61); 81 ht.put("dave", 88); 82 83 System.out.println("alice: " + ht.get("alice")); // 95 84 System.out.println("charlie: " + ht.get("charlie")); // 61 85 System.out.println("eve: " + ht.get("eve")); // null 86 87 ht.remove("bob"); 88 System.out.println("bob removed: " + ht.get("bob")); // null 89 // alice/charlie/dave still findable — tombstone preserves probe chain 90 System.out.println("dave after bob removed: " + ht.get("dave")); // 88 91 } 92}
Output:
alice:   95
charlie: 61
eve:     null
bob removed: null
dave still found: 88

Dry Run: Linear Probing — Insert, Collision, Lookup

Table capacity = 7, h("alice") = 3, h("charlie") = 3

Step 1 — Insert ("alice", 95):
  idx = h("alice") = 3
  table[3] is EMPTY → store here
  table: [null, null, null, (alice,95), null, null, null]

Step 2 — Insert ("charlie", 61):
  idx = h("charlie") = 3
  table[3] is OCCUPIED by "alice" → COLLISION
  Probe: idx = (3 + 1) % 7 = 4
  table[4] is EMPTY → store here
  table: [null, null, null, (alice,95), (charlie,61), null, null]

Lookup "charlie":
  idx = h("charlie") = 3
  table[3].key = "alice" ≠ "charlie" → probe
  idx = 4
  table[4].key = "charlie" = "charlie" → return 61 ✓

Why null cannot be used as tombstone:
  Delete "alice" (at slot 3) — if we set table[3] = null:
    Lookup "charlie": idx=3 → table[3] is null → STOP
    Returns null — but "charlie" is at slot 4!
  The null terminates the probe early, hiding entries behind it.
  Tombstone (DELETED marker) allows probe to continue past deleted slots.

The Tombstone Problem — Why Deletion Is Tricky in Open Addressing

This is the most subtle aspect of open addressing. When you delete an entry, you cannot simply set the slot to null — doing so would break the probe chains of other entries.

Scenario showing the problem:

  h("A") = h("B") = h("C") = 2  (all collide at slot 2)

  After inserting A, B, C with linear probing:
    slot 2 → A
    slot 3 → B   (A occupied slot 2 — B probed to slot 3)
    slot 4 → C   (A and B occupied — C probed to slot 4)

  Delete B (slot 3) — set slot 3 to null:
    table: [null, null, A, null, C, null, null]

  Lookup C:
    idx = h("C") = 2
    table[2] = A ≠ C → probe
    idx = 3
    table[3] = null → STOP (probe terminates at null)
    Returns null — C is at slot 4 but is now unreachable!

  Correct fix — tombstone deletion:
    Set slot 3 to DELETED (not null)
    table: [null, null, A, DELETED, C, null, null]

  Lookup C now:
    idx = h("C") = 2
    table[2] = A ≠ C → probe
    idx = 3
    table[3] = DELETED → continue probing (not null, keep going)
    idx = 4
    table[4] = C = C → return value ✓

Chaining vs Open Addressing: Performance Comparison

The right strategy depends on load factor, cache behavior, and memory constraints.

CHAINING:

  Advantages:
  - Works well at any load factor (even > 1.0)
  - Deletion is simple — remove from linked list, no tombstones
  - Performance degrades gracefully as load increases

  Disadvantages:
  - Extra memory for linked list pointers (16+ bytes per node in Java)
  - Cache unfriendly — list nodes are scattered in memory
  - Pointer chasing on every chain traversal

  Performance:
  - Average lookup: O(1 + α)  where α = load factor = n/capacity
  - α = 0.75: expected 1.75 comparisons per lookup
  - α = 2.0:  expected 3 comparisons per lookup (still usable)

OPEN ADDRESSING (linear probing):

  Advantages:
  - No extra memory for pointers — all data in one array
  - Excellent cache performance — probing accesses adjacent memory
  - Faster in practice at low load factors (α < 0.7)

  Disadvantages:
  - Cannot exceed load factor of 1.0 (no room for another entry)
  - Primary clustering — long runs degrade performance rapidly
  - Deletion requires tombstones — accumulate over time

  Performance:
  - Average successful lookup:   O(1/(1-α)) with linear probing
  - α = 0.5: ~1.5 probes
  - α = 0.7: ~2.5 probes  ← acceptable
  - α = 0.9: ~5.5 probes  ← slow
  - α → 1.0: → infinity   ← table full, must resize

Side-by-Side Example: Same Keys, Same Collisions

Keys to insert: "alice"→h=2, "bob"→h=5, "carl"→h=2, "diana"→h=5
Table capacity: 7

CHAINING:
  slot 2 → [("carl",_) → ("alice",_)]   (both hash to 2)
  slot 5 → [("diana",_) → ("bob",_)]    (both hash to 5)
  Other slots: null
  Memory: 7 array slots + 4 linked list nodes

OPEN ADDRESSING (linear probing):
  slot 2 → ("alice",_)   (first to arrive)
  slot 3 → ("carl",_)    (probed from slot 2)
  slot 5 → ("bob",_)     (first to arrive)
  slot 6 → ("diana",_)   (probed from slot 5)
  Memory: 7 array slots, no extra nodes — all inline

Clustering: The Performance Enemy of Linear Probing

Primary clustering is the main weakness of linear probing. Once a long run of occupied slots forms, it attracts more keys — because any key that lands anywhere in the run gets appended to its end.

Initial:
  [_, _, _, A, B, C, D, _, E, _, _]
   0  1  2  3  4  5  6  7  8  9  10

Keys A,B,C,D,E form a cluster from slots 3-8.

Insert F where h(F)=3:
  Slot 3 occupied → probe 4 occupied → probe 5 occupied →
  probe 6 occupied → probe 7 EMPTY → store F at slot 7

  [_, _, _, A, B, C, D, F, E, _, _]  ← cluster now spans 3-8

Insert G where h(G)=6:
  Slot 6 occupied → probe 7 occupied → probe 8 occupied →
  probe 9 EMPTY → store G at slot 9

Primary clustering: the cluster grows at both ends, attracting more keys.
Lookup cost for any key in the cluster is O(cluster_length) in the worst case.

Solutions:
  Quadratic probing: spread probes further (i², reduces primary clustering)
  Double hashing: per-key step size (nearly eliminates clustering)
  Keep load factor below 0.7 for linear probing

How Production Hash Tables Handle Collisions

You rarely implement collision handling from scratch. Understanding what the standard library uses explains the performance you observe.

Java HashMap (Java 8+):
  Strategy: Separate chaining
  Data structure in each bucket: linked list for chains ≤ 8 nodes,
                                  red-black tree for chains > 8 nodes
  Why tree at 8?: O(log n) per lookup even in worst case — prevents
                  denial-of-service attacks from adversarial keys
  Load factor threshold: 0.75
  Growth: doubles capacity on resize

Python dict (CPython 3.6+):
  Strategy: Open addressing with pseudo-random probing
  Probe sequence uses a combination of linear and perturbation
  (not pure linear probing — avoids primary clustering)
  Load factor threshold: ~0.66
  Growth: ~4× for small dicts, 2× for larger
  Compact representation: keys, values, and hash codes stored in
  parallel arrays for cache efficiency

C++ unordered_map:
  Strategy: Separate chaining (implementation-defined, but all major
           compilers use chaining)
  Load factor threshold: 1.0 by default (configurable via max_load_factor())
  Note: C++ unordered_map is known to be slower in practice than
        custom open-addressing implementations due to cache misses

JavaScript Map (V8 engine):
  Strategy: Implementation-specific — V8 uses hash tables with
           open addressing internally for its hidden classes and maps
  Load factor management is automatic and opaque to the user

Interview Questions

Q: What is a hash collision and how does chaining resolve it?

A collision occurs when two different keys produce the same array index after applying the hash function and modulo. Separate chaining resolves it by allowing multiple entries to coexist at the same slot. Each slot holds a linked list (or similar structure). On insert, the new entry is prepended to the list at the colliding slot. On lookup, the entire list at that slot is scanned linearly. The average chain length equals the load factor α = n/capacity, so O(1 + α) per operation. With α < 0.75, chains stay short and performance remains near O(1).

Q: Why can't you simply set a slot to null when deleting in open addressing?

In open addressing, lookup stops probing when it hits a null slot, interpreting null as "this key does not exist." If an entry was deleted by setting its slot to null, any entry that was pushed past the deleted slot during a previous collision would become unreachable. The probe sequence that originally bypassed that slot would now stop at the null before reaching the target. The fix is to use a tombstone marker — a special sentinel that tells the probe "keep going, something was here but was deleted."

Q: What is primary clustering in linear probing and why does it hurt performance?

Primary clustering occurs when adjacent occupied slots form long contiguous runs. Any key that hashes to any position in the run gets appended to the end of the run during insertion. This means the run grows larger with every collision, attracting more keys exponentially. Lookup cost for keys in or near the cluster increases linearly with cluster length. Quadratic probing reduces primary clustering by spreading probes quadratically, and double hashing nearly eliminates it by using a per-key step size.

Q: Why does Java 8 use a tree instead of a list in long chains?

With a linked list, a bucket with n entries has O(n) lookup. An adversary who knows the hash function can force all keys to the same bucket — turning O(1) into O(n) for every operation. By switching to a red-black tree when a chain exceeds 8 entries, Java caps the worst-case per-bucket lookup at O(log n). This prevents hash-flooding denial-of-service attacks without sacrificing O(1) average-case performance for normal inputs.

FAQs

Which is better in practice — chaining or open addressing?

For general-purpose use, both perform comparably at low load factors. Open addressing (specifically linear probing or Robin Hood hashing) tends to be faster in practice for read-heavy workloads because all data is stored inline in a single array — better cache locality means faster memory access. Chaining is better when the load factor exceeds 0.7, when deletion is frequent (no tombstone accumulation), or when entries vary wildly in size. Most modern high-performance hash tables (Google's absl::flat_hash_map, Rust's HashMap) use open addressing with quadratic or Robin Hood probing.

What is Robin Hood hashing?

Robin Hood hashing is a variant of linear probing that steals slots from "rich" entries (those that are close to their ideal slot) and gives them to "poor" entries (those far from their ideal slot). During insertion, if a new entry's probe distance exceeds the current occupant's probe distance, they swap — the new entry takes the slot and the displaced entry continues probing. This minimizes variance in probe distances and gives much better worst-case performance than standard linear probing.

Does the load factor apply equally to chaining and open addressing?

No. Chaining works at any load factor — even α > 1.0 (more entries than slots). The cost grows linearly as α increases, but the table never "runs out of space." Open addressing requires α < 1.0 — the table is physically full at α = 1.0. In practice, open addressing tables resize much earlier (at α = 0.5 to 0.7) to prevent the severe performance degradation that occurs near full capacity. This is why open addressing typically wastes more memory per entry than chaining.

Quick Quiz

Question 1: With linear probing and table size 7, keys "A" and "B" both hash to slot 3. "A" is inserted first. Where does "B" end up?

  • A) Slot 3 — both can share the slot
  • B) Slot 4 — linear probe finds the next empty slot
  • C) Slot 0 — wraps around from the end
  • D) The insert fails — slot 3 is taken

Answer: B) Slot 4. Linear probing tries (h(k) + i) % capacity for i = 0, 1, 2, ... Slot 3 is occupied by "A", so probe tries slot (3 + 1) % 7 = 4. If slot 4 is empty, "B" is stored there. Open addressing does not allow sharing slots — each slot holds at most one entry.

Question 2: You delete an entry from a linear probing table by setting its slot to null. What goes wrong?

  • A) Nothing — null is a valid deletion marker
  • B) The slot is permanently unusable
  • C) Entries inserted past this slot during a prior collision become unreachable
  • D) The hash function stops working for that index

Answer: C) Entries inserted past this slot become unreachable. Lookup stops at null, interpreting it as "key not present." If a colliding entry was probed past this slot and stored at a later position, the lookup terminates before reaching it. The fix is a tombstone marker that signals "deleted but keep probing."

Question 3: Java 8 switches from a linked list to a red-black tree in a hash bucket when the chain length exceeds:

  • A) 4 entries
  • B) 8 entries
  • C) 16 entries
  • D) When load factor exceeds 0.75

Answer: B) 8 entries. Java switches to a red-black tree at 8 entries per bucket, capping worst-case lookup at O(log n) instead of O(n). This prevents hash-flooding attacks where an adversary forces all keys to the same bucket. The switch back from tree to list occurs at 6 entries (hysteresis prevents thrashing).

Question 4: A chaining hash table has capacity 10, load factor α = 0.9. What is the expected number of comparisons to find a key?

  • A) O(1) — load factor does not affect chaining
  • B) O(0.9) — directly proportional to load factor
  • C) O(1 + α) = O(1.9) — one hash computation plus expected chain length
  • D) O(n) — near-full table means full scan

Answer: C) O(1 + α) = O(1.9). The expected chain length in chaining equals α (by the simple uniform hashing assumption). Lookup costs one hash computation plus scanning the chain — O(1 + α). At α = 0.9, that is 1.9 comparisons on average. Chaining degrades gracefully as α increases, unlike open addressing which degrades catastrophically near α = 1.0.

Summary

Every hash table collision is resolved by one of two fundamental strategies: separate chaining (each slot holds a list) or open addressing (probe for the next empty slot).

The key ideas to carry forward:

  • Collisions are guaranteed — the birthday paradox ensures them well before the table is full
  • Chaining: each slot is a linked list — simple deletion, works at any load factor, but cache-unfriendly and uses extra memory for pointers
  • Open addressing: all entries inline in the array — cache-friendly and memory-efficient, but requires load factor < 1.0 and tombstone deletion
  • Tombstone: the correct way to delete in open addressing — null termination breaks probe chains
  • Primary clustering (linear probing's weakness): long runs of occupied slots form and grow, degrading lookup performance
  • Quadratic probing and double hashing reduce clustering at the cost of implementation complexity
  • Java HashMap uses chaining with red-black trees for long chains (> 8) to cap worst-case performance at O(log n)
  • Python dict uses open addressing with pseudo-random probing and a ~0.66 load factor threshold
  • Keep load factor below 0.75 (chaining) or 0.7 (open addressing) for O(1) average performance

In the next topic, you will explore Frequency Counting — using hash maps to count occurrences, build histograms, and solve the broad class of problems that require knowing how many times each element appears.

Collision Handling