Hash Functions and Collisions¶
Hash Functions¶
A hash function is the heart of any hash table. It takes a key as input and produces an integer (hash code) that determines where the key-value pair should be stored in the underlying array.
Properties of Good Hash Functions¶
1. Deterministic¶
- Same input always produces same output
- Essential for consistent lookups
2. Uniform Distribution¶
- Keys should be spread evenly across the hash table
- Minimizes clustering and collisions
3. Fast Computation¶
- Should be O(1) time complexity
- Avoid expensive operations
4. Avalanche Effect¶
- Small changes in input cause large changes in output
- Helps distribute similar keys
Common Hash Function Techniques¶
1. Division Method¶
def hash_function(key, table_size):
return key % table_size
2. Multiplication Method¶
def hash_function(key, table_size):
A = 0.6180339887 # Golden ratio - 1
return int(table_size * ((key * A) % 1))
3. String Hashing¶
def hash_string(s, table_size):
hash_value = 0
for char in s:
hash_value = (hash_value * 31 + ord(char)) % table_size
return hash_value
Collisions¶
A collision occurs when two different keys hash to the same index. Since hash tables have finite size, collisions are inevitable (pigeonhole principle). When a collision occurs, we are faced with a dilemma - where do we store the new key-value pair since that index is already occupied?
Here are two common strategies to handle such scenarios:
-
Chaining: In this method, each index (orΒ
bucket) in the array hosts a linked list of all key-value pairs that hash to the same index. When a collision occurs, we simply go to the collided index and append the new key-value pair to the existing linked list. -
Open Addressing: Upon encountering a collision, the hash table searches for another free slot or index in the table (possibly the next available empty slot) and assigns that location to the new key-value pair. This approach requires a suitable probing strategy to ensure efficient use of table space.
Collision Resolution Strategies¶
1. Separate Chaining (Open Hashing)¶
Store multiple key-value pairs at each array index using a secondary data structure.
class HashTableChaining:
def __init__(self, size):
self.size = size
self.table = [[] for _ in range(size)]
def _hash(self, key):
return hash(key) % self.size
def insert(self, key, value):
index = self._hash(key)
bucket = self.table[index]
# Check if key already exists
for i, (k, v) in enumerate(bucket):
if k == key:
bucket[i] = (key, value) # Update existing
return
bucket.append((key, value)) # Add new
def get(self, key):
index = self._hash(key)
bucket = self.table[index]
for k, v in bucket:
if k == key:
return v
raise KeyError(key)
Advantages: - Simple to implement - Never runs out of space (can always add to chain) - Good performance with good hash function
Disadvantages: - Extra memory overhead for pointers/lists - Cache performance can be poor - Worst case: O(n) if all keys hash to same bucket
2. Open Addressing (Closed Hashing)¶
Store all key-value pairs directly in the hash table array. When collision occurs, probe for next available slot.
Linear Probing¶
class HashTableLinearProbing:
def __init__(self, size):
self.size = size
self.keys = [None] * size
self.values = [None] * size
def _hash(self, key):
return hash(key) % self.size
def insert(self, key, value):
index = self._hash(key)
# Linear probing
while self.keys[index] is not None:
if self.keys[index] == key:
self.values[index] = value # Update existing
return
index = (index + 1) % self.size
# Insert at empty slot
self.keys[index] = key
self.values[index] = value
def get(self, key):
index = self._hash(key)
# Linear probing for search
while self.keys[index] is not None:
if self.keys[index] == key:
return self.values[index]
index = (index + 1) % self.size
raise KeyError(key)
Advantages: - Better cache performance (data locality) - No extra memory for pointers - Simple implementation
Disadvantages: - Primary clustering (consecutive occupied slots) - Requires careful deletion handling - Performance degrades as load factor increases
Quadratic Probing¶
def quadratic_probe(self, key):
index = self._hash(key)
i = 0
while self.keys[index] is not None:
if self.keys[index] == key:
return index
i += 1
index = (self._hash(key) + i*i) % self.size
return index
Advantages: - Reduces primary clustering - Better distribution than linear probing
Disadvantages: - Secondary clustering - May not probe all slots - More complex deletion
Double Hashing¶
def double_hash_probe(self, key):
hash1 = hash(key) % self.size
hash2 = 7 - (hash(key) % 7) # Second hash function
index = hash1
i = 0
while self.keys[index] is not None:
if self.keys[index] == key:
return index
i += 1
index = (hash1 + i * hash2) % self.size
return index
Advantages: - Eliminates clustering - Good distribution with proper second hash function
Disadvantages: - More complex implementation - Requires two hash functions
Load Factor and Resizing¶
Load Factor (Ξ±)¶
Ξ± = n / m
where n = number of elements, m = table size
Performance Impact¶
- Separate Chaining: Performance degrades linearly with load factor
- Open Addressing: Performance degrades exponentially after Ξ± > 0.7
Dynamic Resizing¶
def resize(self):
old_keys = self.keys
old_values = self.values
# Double the size
self.size *= 2
self.keys = [None] * self.size
self.values = [None] * self.size
# Rehash all existing elements
for i in range(len(old_keys)):
if old_keys[i] is not None:
self.insert(old_keys[i], old_values[i])
Python's Hash Implementation¶
Python dict¶
- Uses open addressing with random probing
- Maintains insertion order (since Python 3.7)
- Automatically resizes when load factor exceeds 2/3
- Uses sophisticated hash functions for different types
Python set¶
- Similar to dict but stores only keys
- Uses dummy values internally
- Same collision resolution strategy
Common Hash Function Pitfalls¶
- Poor Distribution: Using simple modulo with non-prime table sizes
- Predictable Patterns: Hash functions that don't handle similar keys well
- Integer Overflow: Not handling large hash values properly
- Security Issues: Hash functions vulnerable to collision attacks
Best Practices¶
- Choose Prime Table Sizes: Better distribution with division method
- Monitor Load Factor: Resize before performance degrades
- Use Quality Hash Functions: Consider built-in functions for complex types
- Handle Collisions Appropriately: Choose strategy based on use case
- Consider Security: Use cryptographic hash functions for security-critical applications
Next Steps¶
- Hash Tables Overview - Basic concepts
- Python Dictionaries - Implementation details
- Hash Table Problems - Practice problems