Optimistic Cuckoo Hashing for concurrent, read-intensive applications
Our FAWN team has been spending a lot of time looking at memory-efficient data and algorithms structures for various things (with a lot of emphasis on key-value stores, as per our first and second papers on FAWN-KV and SILT, respectively).
Coming up at NSDI'13, Bin Fan has a new paper in which we substantially improve the throughput of the memcached distributed DRAM cache. One of the core techniques we use is a new multiple-reader, single-writer concurrent variant of Cuckoo hashing that we call optimistic cuckoo hashing. It combines the refinement of a technique we introduced in SILT ("partial-key cuckoo hashing"), a new way of moving items during cuckoo insertion, and an optimistic variant of lock-striping to create a hash table that is extremely compact and supports extremely high read throughput, while still allowing one thread to update it at high speed (about 2M updates/second in our tests).
We've released the code on github to this and you should check it out for your concurrent hash table needs. The usual caveats apply - it's a very basic, fixed-size C implementation. We're still polishing it up, adding auto-resizing and C++ friendliness, etc.
Linear Probing: Puts entries in a large array. The slot for item x is determined by computing s = h(x). If array[s] is full, check array[s+1], and so on, until you find an open slot. To search for an item, check at h(key), and keep moving forward until either you find an item whose key matches what you're looking for, or you find an empty slot. In this scheme, the table is limited to about 50% occupancy, or else both inserts and queries will take unacceptably long.
Chaining: The slot for item x is s = h(x). Each slot is a linked list of items. When multiple items hash to the same slot, add them to the list. In this scheme, slot occupancy can be high, but: (a) There's overhead from storing the linked list pointers; and (b) the time to search for an item can be longer when several items hash to the same slot.
There are several more "modern" approaches that avoid the drawbacks of these schemes. Cuckoo hashing is one of them. In cuckoo hashing, every object can hash to k different slots, determined by k different hash functions. A typical refinement is that every slot consists of b buckets (each slot can contain up to b different items). One popular choice, and that which we typically use in our applications, is "2,4 cuckoo hashing" (k=2 slots, b=4 buckets). This works particularly well when you can arrange your data layout such that 4 buckets fit exactly in one cache line, for reasons we'll see in a second.
To search a cuckoo hash table, compute the two slots s1 = h1(key), s2=h2(key). Examine every bucket in each of these two slots to see if it contains the item you want. If it doesn't, return failure. In loosey-goosey pseudocode:
find(key):
foreach slot s in (s1, s2):
foreach bucket b in s:
if b.key == key:
return true, b.value
return false, nil
Inserting requires a little more work. If there's an available bucket in one of the two slots, insert the new item there. But if all of them are full, pick one of the existing items and kick it out to its own alternate bucket. This is shown in the figure to the right, where an insert of x requires displacing item b, which in turn requires displacing item h.
This basic 2,4-cuckoo hash can achieve roughly 95% table occupancy before insertion will fail by taking too long (we use 500 displacements as a measure of "too long", as suggested by Michael Mitzenmacher). However, it's not concurrent.
The only previous work we've been able to find on concurrent cuckoo hashing, by Shavit and Herlihy, (here's an implementation) supports concurrent writers, but it has two drawbacks for our purposes: It gets less than 50% table occupancy, and it requires mutex operations even for read. So, we came up with something that we like better for our memory-efficient, read-intensive workloads.
(In this graph, chaining uses a global lock.) Optimistic cuckoo hashing scales roughly to the same number of threads as 1/write %. In other words, a 10% write workload should scale decently up to about 10 total threads, and achieve perhaps 20 million total requests per second. A 99% read workload scales extremely well. This design is particularly optimized for small key-value pairs where the hash table overhead itself is substantial; It's probably less useful if your objects are 1KB. The OCH is not the best choice for everything - the lack of auto-resizing is a big drawback compared to, e.g., hash_map. But for read-intensive workloads where there's a size limit known in advance (or where you don't mind implementing re-sizing on your own), it handily outperforms things like TBB's concurrent_hash_map.
Coming up at NSDI'13, Bin Fan has a new paper in which we substantially improve the throughput of the memcached distributed DRAM cache. One of the core techniques we use is a new multiple-reader, single-writer concurrent variant of Cuckoo hashing that we call optimistic cuckoo hashing. It combines the refinement of a technique we introduced in SILT ("partial-key cuckoo hashing"), a new way of moving items during cuckoo insertion, and an optimistic variant of lock-striping to create a hash table that is extremely compact and supports extremely high read throughput, while still allowing one thread to update it at high speed (about 2M updates/second in our tests).
We've released the code on github to this and you should check it out for your concurrent hash table needs. The usual caveats apply - it's a very basic, fixed-size C implementation. We're still polishing it up, adding auto-resizing and C++ friendliness, etc.
Benefits and trade-offs of optimistic cuckoo hashing
Benefits:- Fast concurrent read throughput (no mutex required)
- Decent write throughput (each update requires acquiring a mutex)
- Very memory efficient, particularly for small key/value pairs
- Predictable and fast read performance: Every read takes exactly two memory references.
- No dynamic resizing of the hash table. If it fills up, you have to destroy it, create a new, bigger one, and re-populate it.
- Slower than some other techniques for write-heavy (>50%) workloads. Construction with a single thread is about 1/2 the speed of, e.g., chaining-based approaches that can simply stuff a new entry on the head of a list.
- I suspect there may be some high-contention write workloads for which other techniques might be better. I'd be curious to know what they are.
Background to understand the algorithm: Cuckoo Hashing
Cuckoo Hashing [Pagh & Rodler 2001] is a very nice O(1) hashing scheme that, with suitable tweaking, can achieve very high table utilization. To understand what I mean by that, consider two conventional approaches to hashing:Linear Probing: Puts entries in a large array. The slot for item x is determined by computing s = h(x). If array[s] is full, check array[s+1], and so on, until you find an open slot. To search for an item, check at h(key), and keep moving forward until either you find an item whose key matches what you're looking for, or you find an empty slot. In this scheme, the table is limited to about 50% occupancy, or else both inserts and queries will take unacceptably long.
Chaining: The slot for item x is s = h(x). Each slot is a linked list of items. When multiple items hash to the same slot, add them to the list. In this scheme, slot occupancy can be high, but: (a) There's overhead from storing the linked list pointers; and (b) the time to search for an item can be longer when several items hash to the same slot.
There are several more "modern" approaches that avoid the drawbacks of these schemes. Cuckoo hashing is one of them. In cuckoo hashing, every object can hash to k different slots, determined by k different hash functions. A typical refinement is that every slot consists of b buckets (each slot can contain up to b different items). One popular choice, and that which we typically use in our applications, is "2,4 cuckoo hashing" (k=2 slots, b=4 buckets). This works particularly well when you can arrange your data layout such that 4 buckets fit exactly in one cache line, for reasons we'll see in a second.
To search a cuckoo hash table, compute the two slots s1 = h1(key), s2=h2(key). Examine every bucket in each of these two slots to see if it contains the item you want. If it doesn't, return failure. In loosey-goosey pseudocode:
find(key):
foreach slot s in (s1, s2):
foreach bucket b in s:
if b.key == key:
return true, b.value
return false, nil
Inserting requires a little more work. If there's an available bucket in one of the two slots, insert the new item there. But if all of them are full, pick one of the existing items and kick it out to its own alternate bucket. This is shown in the figure to the right, where an insert of x requires displacing item b, which in turn requires displacing item h.
This basic 2,4-cuckoo hash can achieve roughly 95% table occupancy before insertion will fail by taking too long (we use 500 displacements as a measure of "too long", as suggested by Michael Mitzenmacher). However, it's not concurrent.
The only previous work we've been able to find on concurrent cuckoo hashing, by Shavit and Herlihy, (here's an implementation) supports concurrent writers, but it has two drawbacks for our purposes: It gets less than 50% table occupancy, and it requires mutex operations even for read. So, we came up with something that we like better for our memory-efficient, read-intensive workloads.
Optimistic Cuckoo Hashing
Let's fix three problems with the basic cuckoo scheme described above, in order:- For variable-length keys, it has a lot of pointer dereferences that I didn't show;
- It's not concurrent
- Let's fix #2 without grabbing a mutex on read.
Handling variable length keys with partial-key cuckoo hashing
If the keys are variable length, you don't (usually) want to store them in the hashtable buckets themselves, because you have to size the buckets to the largest possible key they could store. Instead, you take the standard computer science solution and add a level of indirection: Store a pointer to the key instead.
With the 2,4-cuckoo, this means that looking for a particular key may have to dereference eight pointers. While it can do those in parallel, that still stinks. Further, we have to dereference the pointer when moving the contents of a bucket, because we have to find the key's other hash. Partial-key cuckoo hashing fixes both of these problems by adding a small (e.g., 8 bit) hash, or tag, to the bucket, and lets that hash be used for cuckoo movement as well. tag(key) = hash2(key)[bits 0...7].
Searching is straightforward: For each bucket, check whether the tag matches the tag of the key you're looking for. If and only if it does, dereference to compare the full key. Using an 8-bit (one byte) tag, you only dereference incorrectly 1 in 256 times -- not bad.
To move an item using the tag, we compute:
bucket1 = hash(key)
bucket2 = hash(key) XOR hash(tag(key))
Because the XOR is reversible, to move the key to its alternate location (whichever slot it's in), you just xor it's current slot ID with the hash of the tag. We re-hash here because the tag is only 8 bits, so rehashing lets the new location move globally throughout the hash table. We have a different paper under submission that shows empirically that this works well enough, and provides a bit of incomplete theoretical intuition for why.
The consequences are twofold: Most searches never make unnecessary pointer dereferences; and we can cuckoo entries around the table without pointer dereferences. Being able to do this lets us manipulate individual entries atomically, which turns out to be very important for...
Supporting concurrent readers in cuckoo hashing
Supporting concurrent readers is really all about the writers: If you never change the table, concurrent reading is easy...
step 1: Only allow one writer, by forcing them to grab a global lock, or by allowing only one thread to do inserts.
But this isn't enough. Two problems arise for reading while another thread can be writing to the table:
- The item you're looking for might be "in transit" - being cuckooed.
- The pointer to the key/value might point to the wrong thing after an update
By doing this, instead of having a particular entry disappear from the hash table during motion, we instead duplicate it, letting the "hole" disappear briefly. Which is perfectly OK - a reader will still find one of them. It does have consequences if you want to iterate through every item in the hash table, but we're not doing that.
This has a second benefit: You can easily do two such searches in parallel, using a little more memory bandwidth but taking less time overall, because you find a hole faster. Using two parallel searches speeds up inserts into our table by about 20%.
Solving #2 requires a bit more work. We need to coordinate between the writer and the readers, but we'd like to do so without using a lock. Instead, we use an optimistic variant of the "lock-striping" technique.
Basic lock striping: Instead of having one lock for every entry in the table, share a set of locks among all keys by taking hash(key)%N_LOCKS. 8192 "locks" (ahem - we don't really use locks, as I'll explain below) works well in our experiments. By doing this, we add very little space to the table for locks - 8192 for a hash table of millions.
Optimistic versioning using "counter striping": We use the well-known optimistic technique of storing a version counter (striped, as above). When the insert process wishes to modify a key, it first increments the counter. When it has completed modifying the key, it increments it again. This means that a key whose value is odd is "in progress".
All readers perform their operations by:
retry:
counter_start = atomic_read(counter for key)
if counter_start is odd goto retry;
check bucket ... grab key/value data if it matches ...
counter_end = atomic_read(counter for key)
if (counter_end != counter_start) goto retry;
While this still uses atomic memory operations, it has a huge advantage over mutexes for read-mostly workloads: In the common case, the counters are read-shared between all threads and are only infrequently modified/invalidated.
The bottom line
On a dual CPU Xeon L5640 (2.27GHz, 6 cores each), for very small entries in the hash table (enough to store a tag and a 64 bit pointer to where the real key/value pair would be stored), with 225 (about 10 million) entries, the table performs as:
(In this graph, chaining uses a global lock.) Optimistic cuckoo hashing scales roughly to the same number of threads as 1/write %. In other words, a 10% write workload should scale decently up to about 10 total threads, and achieve perhaps 20 million total requests per second. A 99% read workload scales extremely well. This design is particularly optimized for small key-value pairs where the hash table overhead itself is substantial; It's probably less useful if your objects are 1KB. The OCH is not the best choice for everything - the lack of auto-resizing is a big drawback compared to, e.g., hash_map. But for read-intensive workloads where there's a size limit known in advance (or where you don't mind implementing re-sizing on your own), it handily outperforms things like TBB's concurrent_hash_map.
When moving the "hole" don't you still need a barrier of some sort to ensure other threads see the writes in the way you expect?
ReplyDeleteYeskinda. The optimistic part has a barrier around the reads of the version counters. So as long as you do read_version, read_bucket, read_and_check_version before using the value you found in read_bucket, you're ok.
ReplyDeleteThe slightly longer answer is that if the read_bucket contains a pointer, you have no guarantees from the hash table itself that the pointer still points to something valid afterwords. In MemC3, we guarantee this by virtue of Memcached's slab allocator (the pointer will always go to somewhere within the slab) and re-checking the version counter after we read the pointed-to data. We then avoid having to double-read the version counter by taking advantage of the atomicity of 64 bit writes under x86 to ensure that the pointer is always *a* valid pointer, if not to the item we thought it was. In other words:
check_version
dereference and read the dereferenced block but don't return data (safe because it's guaranteed by the surrounding app semantics to be a valid pointer)
check_version
do things that modify data or return data based upon your copy of the dereferenced data.
(p.s., snirk: When trying to reply, Blogger gave me the error "Input error: Memcache value is null for FormRestoration". :-)