ML Wiki

Consistent Hashing

To scale incrementally, Distributed Databases need a mechanism to dynamically partition over a set of nodes. Consistent Hashing is one of them: it allows to distribute load across several nodes.

"Regular" hashing

need assign $M$ data keys to $N$ servers
assign each key to server number $k \text{ mod } N$

What happens if we increase a number of serves from $N$ to $2N$ ? Every existent key will have to be remapped.

Consistent Hashing approach

In consistent hashing a hash function is viewed as a ring: largest hash values wrap around to smallest
The ring is divided onto $N$ regions ( $N$ - number of servers)
Each server has its own key region (its "position" on the ring)
$\Rightarrow$ adding or removing a node affects only direct neighbors

for example

the key region for 2nd server is the area between 1 and 2
and only 2 is responsible for keys in that region

Suppose we want to add a new server

we just pick some area
and divide it on 2 parts
and then assign the new server one of these two
the keys that happen to be in that region are moved to the new server

So routing is simple in this schema:

each server knows the key range which it manages
so we can route the request to the server that is closes to the key we're looking for

Virtual Nodes

There are some challenges with this basic approach

random position assignment may lead to non-uniform data/load distribution
heterogeneity is performance is assumed (that is, we assume that all the servers have same performance)

A variant of Consistent Hashing algorithm addresses this issue:

instead of mapping a single node to the ring,
each node gets multiple points there
so each node has several virtual nodes

A virtual node looks like a single node, but it refers to the real node.

Advantages

if a node becomes unavailable, the load is distributes across the remained nodes uniformly (not just the closest neighbor gets all the load)
and when a new node is added, it gets roughly equivalent amount of load from each node
number of virtual nodes is chosen based on the capabilities of a node

Sources

Introduction to Data Science (coursera)
Amazon's Dynamo paper [1]

Consistent Hashing

Contents

Consistent Hashing

"Regular" hashing

Consistent Hashing approach

Virtual Nodes

Sources