ML Wiki

Extensible Hashing

Main ideas:

Variables we use:

suppose $i = 1$
for key $k:$ $h(k) = 0010$ and first $i$ bits are 0: look it up in the directory
follow the pointer
there is enough room so add in there

suppose $i = 1$
given hash $h(k) = 1010$ we extract first 1 bit which is 1
we look 1 up in the directory and follow the pointer
but there's no room in this block
- so we split it into 2 parts
- keep ones that start with 10 in one
- and move ones that start with 11 to another
- that means that both blocks use 2 bits to assign a key to a bucket (and we indicate that value on top of the buckets)
but to address these new buckets now we need 2 bits, and still i=1
- i.e. $i$ in the directory becomes less than at least one $i$ from buckets
- that means we need to create a new directory
- if it wasn't the case, we just would re-wire pointers to the dict
note that bucket for (0) is still the same - so both 00 and 01 in the new directory point to this bucket
if now we insert 0000 and 0100, we will reorganize the first bucket, but will not rebuild the directory

So the rule is:

if $i$ for one of the bucket grows more than $i$ of the directory
we need to rebuild the prefix directory
all other blocks are kept untouched (otherwise we would have to re-organize the whole thing)

Just the opposite of #Insertion

Pros

Cons

another level of indirection. Not that bad if we can store the entire directory in memory, but it's not often the case
size of directory growth is exponential to i - quickly becomes the bottleneck and may introduce a lot of latency
- it fits into memory and then all of a sudden it doesn't

Linear Hashing is another alternative that can handle growing files better