Coming up at ALENEX 2013

Coming up at ALENEX 2013, in which Dave discovers that perfect hashing and friends can be a really intellectually engaging problem. This is the algorithm engineering paper about the most memory-efficient index we developed for SILT, with more analysis and depth about the index itself vs. focusing on the big system.
Practical Batch-Updatable External Hashing with Sorting
Hyeontaek Lim, David G. Andersen, Michael Kaminsky
The key thing about this paper is that we've achieved the best-yet overhead for static external perfect hashing. That's a mouthful: Concretely, we use 2.5 bits of DRAM to tell you exactly where to look on your SSD to find a stored key-value pair. The data is stored densely on the SSD, wasting no space on the storage. The prior best approaches were Botelho's EPH scheme, which used about 3.8 bits per key, and the "EM" variant of EPH, which used about 3.1--3.3 bits/key. As a systems geek, I also really like that Hyeontaek's algorithm uses only sorting as its data manipulation primitive:
  1. It means you can use an off-the-shelf, highly engineered sort program such as nsort. In fact, we modified EPH to use nsort and made it run faster, too.
  2. If necessary, you can drop the in-memory index completely and binary search the on-disk representation. This enables you to degrade performance gracefully if your data grows beyond what you have memory capacity for -- and gives a nice, easily-implemented, high-confidence way of verifying that the index is working properly.

But the rest of the details are in the paper. What I also find amusing about this is the backstory. We started the project a bit ignorant of the theory progress in this area recently, and so we ended up inventing a different scheme of our own. When I was in grad school, Hari once mentioned that sometimes he deliberately wouldn't do a related work scan until a bit after he'd spent time thinking about the problem, and to me, this was a good example of how that can make sense: I don't think we'd have stumbled upon the compressed index method we used if we'd started out by trying to improve upon the technique used by Bothelo or others.
The source code to the indexing is released as part of the source code to the SILT system, but we're happy to release it individually if that makes it easier for people to run with it.
Hopefully coming up in a few months: Our scheme still uses more memory than the best internal minimal perfect hashing function, the CHD algorithm. The difference is that for internal, they can read a few keys as part of the lookup (they're stored in the fast memory), whereas in external, access to the key itself is slow. We have a few ideas for how to make progress on this problem, although I think that if we do beat them, it will be by a small fraction of a bit per key.


Popular posts from this blog

Reflecting on CS Graduate Admissions

Minting Money with Monero ... and CPU vector intrinsics

Finding Bugs in TensorFlow with LibFuzzer