Optimistic Cuckoo Hashing for concurrent, read-intensive applications

Our FAWN team has been spending a lot of time looking at memory-efficient data and algorithms structures for various things (with a lot of emphasis on key-value stores, as per our first and second papers on FAWN-KV and SILT, respectively).

Coming up at NSDI'13, Bin Fan has a new paper in which we substantially improve the throughput of the memcached distributed DRAM cache.  One of the core techniques we use is a new multiple-reader, single-writer concurrent variant of Cuckoo hashing that we call optimistic cuckoo hashing.  It combines the refinement of a technique we introduced in SILT ("partial-key cuckoo hashing"), a new way of moving items during cuckoo insertion, and an optimistic variant of lock-striping to create a hash table that is extremely compact and supports extremely high read throughput, while still allowing one thread to update it at high speed (about 2M updates/second in our tests).

We've released the code on github to this and you should check it out for your concurrent hash table needs.  The usual caveats apply - it's a very basic, fixed-size C implementation.  We're still polishing it up, adding auto-resizing and C++ friendliness, etc.

Benefits and trade-offs of optimistic cuckoo hashing

Benefits:
  • Fast concurrent read throughput (no mutex required)
  • Decent write throughput (each update requires acquiring a mutex)
  • Very memory efficient, particularly for small key/value pairs
  • Predictable and fast read performance:  Every read takes exactly two memory references.
Limitations:
  • No dynamic resizing of the hash table.  If it fills up, you have to destroy it, create a new, bigger one, and re-populate it.
  • Slower than some other techniques for write-heavy (>50%) workloads.  Construction with a single thread is about 1/2 the speed of, e.g., chaining-based approaches that can simply stuff a new entry on the head of a list.
  • I suspect there may be some high-contention write workloads for which other techniques might be better.  I'd be curious to know what they are.
We've successfully applied this to speeding up and improving the memory use of memcached, and are looking at it for things like forwarding tables (see poster at NSDI), all of which are read-heavy and the max size of the hash table is easy to know in advance.

Background to understand the algorithm:  Cuckoo Hashing

Cuckoo Hashing [Pagh & Rodler 2001] is a very nice O(1) hashing scheme that, with suitable tweaking, can achieve very high table utilization.  To understand what I mean by that, consider two conventional approaches to hashing:

Linear Probing: Puts entries in a large array.  The slot for item x is determined by computing s = h(x).  If array[s] is full, check array[s+1], and so on, until you find an open slot.  To search for an item, check at h(key), and keep moving forward until either you find an item whose key matches what you're looking for, or you find an empty slot.  In this scheme, the table is limited to about 50% occupancy, or else both inserts and queries will take unacceptably long.


Chaining:  The slot for item x is s = h(x).  Each slot is a linked list of items.  When multiple items hash to the same slot, add them to the list.  In this scheme, slot occupancy can be high, but:  (a)  There's overhead from storing the linked list pointers; and (b) the time to search for an item can be longer when several items hash to the same slot.



There are several more "modern" approaches that avoid the drawbacks of these schemes.  Cuckoo hashing is one of them.  In cuckoo hashing, every object can hash to k different slots, determined by k different hash functions.  A typical refinement is that every slot consists of b buckets (each slot can contain up to b different items).  One popular choice, and that which we typically use in our applications, is "2,4 cuckoo hashing" (k=2 slots, b=4 buckets).  This works particularly well when you can arrange your data layout such that 4 buckets fit exactly in one cache line, for reasons we'll see in a second.

To search a cuckoo hash table, compute the two slots s1 = h1(key), s2=h2(key).  Examine every bucket in each of these two slots to see if it contains the item you want.  If it doesn't, return failure.  In loosey-goosey pseudocode:

find(key):
   foreach slot s in (s1, s2):
      foreach bucket b in s:
          if b.key == key:
              return true, b.value
   return false, nil

Inserting requires a little more work.  If there's an available bucket in one of the two slots, insert the new item there.  But if all of them are full, pick one of the existing items and kick it out to its own alternate bucket.  This is shown in the figure to the right, where an insert of x requires displacing item b, which in turn requires displacing item h.

This basic 2,4-cuckoo hash can achieve roughly 95% table occupancy before insertion will fail by taking too long (we use 500 displacements as a measure of "too long", as suggested by Michael Mitzenmacher).  However, it's not concurrent.

The only previous work we've been able to find on concurrent cuckoo hashing, by Shavit and Herlihy, (here's an implementation) supports concurrent writers, but it has two drawbacks for our purposes:  It gets less than 50% table occupancy, and it requires mutex operations even for read.  So, we came up with something that we like better for our memory-efficient, read-intensive workloads.

Optimistic Cuckoo Hashing

Let's fix three problems with the basic cuckoo scheme described above, in order:
  1. For variable-length keys, it has a lot of pointer dereferences that I didn't show;
  2. It's not concurrent
  3. Let's fix #2 without grabbing a mutex on read.

Handling variable length keys with partial-key cuckoo hashing

If the keys are variable length, you don't (usually) want to store them in the hashtable buckets themselves, because you have to size the buckets to the largest possible key they could store.  Instead, you take the standard computer science solution and add a level of indirection:  Store a pointer to the key instead.

With the 2,4-cuckoo, this means that looking for a particular key may have to dereference eight pointers.  While it can do those in parallel, that still stinks.  Further, we have to dereference the pointer when moving the contents of a bucket, because we have to find the key's other hash.  Partial-key cuckoo hashing fixes both of these problems by adding a small (e.g., 8 bit) hash, or tag, to the bucket, and lets that hash be used for cuckoo movement as well.  tag(key) = hash2(key)[bits 0...7].

Searching is straightforward:  For each bucket, check whether the tag matches the tag of the key you're looking for.  If and only if it does, dereference to compare the full key.  Using an 8-bit (one byte) tag, you only dereference incorrectly 1 in 256 times -- not bad.

To move an item using the tag, we compute:
   bucket1 = hash(key)
   bucket2 = hash(key) XOR hash(tag(key))

Because the XOR is reversible, to move the key to its alternate location (whichever slot it's in), you just xor it's current slot ID with the hash of the tag.  We re-hash here because the tag is only 8 bits, so rehashing lets the new location move globally throughout the hash table. We have a different paper under submission that shows empirically that this works well enough, and provides a bit of incomplete theoretical intuition for why.

The consequences are twofold:  Most searches never make unnecessary pointer dereferences;  and we can cuckoo entries around the table without pointer dereferences.  Being able to do this lets us manipulate individual entries atomically, which turns out to be very important for...

Supporting concurrent readers in cuckoo hashing

Supporting concurrent readers is really all about the writers:  If you never change the table, concurrent reading is easy...

step 1:  Only allow one writer, by forcing them to grab a global lock, or by allowing only one thread to do inserts.

But this isn't enough.  Two problems arise for reading while another thread can be writing to the table:
  1. The item you're looking for might be "in transit" - being cuckooed.
  2. The pointer to the key/value might point to the wrong thing after an update
To fix #1, we'll reverse the way we cuckoo.  This turns out to improve insertion performance in general.  Instead of picking an existing item up and moving it around until we find a place for it to land, we'll search forward first to find an empty location (and remember the series of displacements that it would take to get there), and then move the "hole" back to where we want to insert the item.

By doing this, instead of having a particular entry disappear from the hash table during motion, we instead duplicate it, letting the "hole" disappear briefly.  Which is perfectly OK - a reader will still find one of them.  It does have consequences if you want to iterate through every item in the hash table, but we're not doing that.

This has a second benefit:  You can easily do two such searches in parallel, using a little more memory bandwidth but taking less time overall, because you find a hole faster.  Using two parallel searches speeds up inserts into our table by about 20%.

Solving #2 requires a bit more work.  We need to coordinate between the writer and the readers, but we'd like to do so without using a lock.  Instead, we use an optimistic variant of the "lock-striping" technique.

Basic lock striping:  Instead of having one lock for every entry in the table, share a set of locks among all keys by taking hash(key)%N_LOCKS.  8192 "locks"  (ahem - we don't really use locks, as I'll explain below) works well in our experiments.  By doing this, we add very little space to the table for locks - 8192 for a hash table of millions.

Optimistic versioning using "counter striping":  We use the well-known optimistic technique of storing a version counter (striped, as above).  When the insert process wishes to modify a key, it first increments the counter.  When it has completed modifying the key, it increments it again.  This means that a key whose value is odd is "in progress".

All readers perform their operations by:

retry:
   counter_start = atomic_read(counter for key)
   if counter_start is odd goto retry;

   check bucket ... grab key/value data if it matches ...
   counter_end = atomic_read(counter for key)
   if (counter_end != counter_start) goto retry;

While this still uses atomic memory operations, it has a huge advantage over mutexes for read-mostly workloads:  In the common case, the counters are read-shared between all threads and are only infrequently modified/invalidated.

The bottom line

On a dual CPU Xeon L5640 (2.27GHz, 6 cores each), for very small entries in the hash table (enough to store a tag and a 64 bit pointer to where the real key/value pair would be stored), with 225 (about 10 million) entries, the table performs as:

(In this graph, chaining uses a global lock.)  Optimistic cuckoo hashing scales roughly to the same number of threads as 1/write %.  In other words, a 10% write workload should scale decently up to about 10 total threads, and achieve perhaps 20 million total requests per second.  A 99% read workload scales extremely well.  This design is particularly optimized for small key-value pairs where the hash table overhead itself is substantial;  It's probably less useful if your objects are 1KB.  The OCH is not the best choice for everything - the lack of auto-resizing is a big drawback compared to, e.g., hash_map.  But for read-intensive workloads where there's a size limit known in advance (or where you don't mind implementing re-sizing on your own), it handily outperforms things like TBB's concurrent_hash_map.

Popular posts from this blog

Stealing Google's Coding Practices for Academia

Reflecting on CS Graduate Admissions

Minting Money with Monero ... and CPU vector intrinsics