NSDI Paper - SplitScreen Virus Scanning

(work-related post ahead)
The camera-ready version of our NSDI paper, SplitScreen: Enabling Efficient, Distributed Malware Detection is now online. I'm mentioning it because it has (to me) a fun story behind it: The project started out with absolutely nothing to do with virus or malware scanning. I like this one as an example of happenstance in research, as I'll explain. And it has a cute trick involving Bloom filters, so what more could you want?
Two years ago, I was chatting with Tom Mitchell about his Read the Web project. Part of RTW involves searching the web to count how many times a phrase occurs, for millions of phrases. In other words, fgrep -f BigPhraseFile web/*. The BigPhraseFile is big. The web is really big.
We wanted to run this search on our FAWN cluster of tiny computers, to see if we could do it better, faster, cheaper, lower-power. But then a problem arose: grep built a 4 gigabyte search trie (using a variant of Aho-Corasick). Our first-generation wimpy nodes had 256MB of memory. We promptly locked up the cluster and had to reboot everything.
Iulian Moraru spent the next while creating a memory-optimized algorithm to solve the problem, and succeeded. The technique, feed-forward bloom filters splits the processing into two steps. The first process imprecisely filters the two inputs (the phrases and the web) to produce two outputs: A smaller set of phrases, which includes all potentially matching phrases and some extras, and a smaller version of the web, which includes all potentially matching lines from the web corpus. We then run a conventional exact-match grep on those smaller files. The result is that it uses a twentieth of the memory. This, in turn, means that the problem can fit in cache better on normal processors. We then applied further techniques to cache-optimize the data structure.
The result works for Tom's read the web problem -- and then David Brumley suggested that we might try applying it to virus scanning. David's student Sang Kil Cha and Iulian integrated the FFBF processing into the open-source ClamAV virus scanner, and managed to (at least) double the speed that it can scan for viruses.
I find this one fun because it illustrates in one paper the very diverse set of inputs it took to reach the end: We had a bunch of memory-constrained nodes, so we were thinking about a particular set of challenges. Tom's RTW project got us hooked on the idea of making grep work on those nodes. David suggested the application to ClamAV that put us on the path towards the NSDI paper. There's something to be said for surrounding yourself with smart people. :)
p.s., we now have a very fast hammer - feed forward bloom filters - if you think you have an appropriate nail. :) Code available upon request. We're also still searching for the right venue in which to publish a paper about the FFBFs themselves - some place that likes algorithm implementation details, minimizing TLB misses, cache lines, and bloom filter math... thoughts?

Comments from Old Blog

  1. Anon: possible venues: alenex?
  2. When you decide on a venue for FFBF, please post about it.
    I have a similar work and also can't find a good place to publish it.
  3. Ephermata replied: Maybe ask Michael Mitzenmacher about the right venue? That seems like the kind of algorithm he enjoys pushing.
  4. I replied:
  5. Alenex has come up before. I'll have to dig on that one - I'm not familiar with it. (Thank you!).
    I'll post where we submit. :)
    Re Mitzenmacher: Also a good suggestion. Now that he's no longer our shepherd, I can do that. :)

  • Anon: That would be right up SPAA's alley, though the deadline timing is bad.
  • dorra11 (deleted blog) noted: Looking forward to get in touch with that paper, please keep us updated. I didn't even knew the split screen scanning option exists before finding your post, it looks like you did a really good job guys! Got any tips on regcure applications?
  • Comments

    Popular posts from this blog

    Reflecting on CS Graduate Admissions

    Minting Money with Monero ... and CPU vector intrinsics

    Finding Bugs in TensorFlow with LibFuzzer