Finding Bugs in TensorFlow with LibFuzzer

xkcd 1137
Over the past year, I've spent some time working on improving the robustness of TensorFlow.  As I mentioned earlier, one of my goals for my time at Google was to dive into industry best-practices for writing good code.  At Google, writing good code starts with careful programmers, requires good tests that get run on a fantastic internal testing infrastructure, is improved through code review, and makes use of several code quality tools and linters.

One part of that testing that's been gaining more visibility recently is fuzzing - throwing random inputs at programs or libraries to try to cause them to crash.  John Regehr has been fuzzing compilers for a while now - very effectively.  (That link has a nice taxonomy of the types of fuzzers.)  Google's Project Zero has been fuzzing the FreeType library for the last 4 years, and has found a tremendous number of security vulnerabilities in all sorts of programs.   (This isn't to suggest fuzzing is new - it's been used for years by security researchers, and was invented in 1988 by Barton Miller.  Back in 1997, amusingly, before I knew what the term fuzzing was, a much younger me used a random input generator to find a bug in the terminal servers I was deploying at my ISP.  I remember being very pleased with myself at the time -- probably excessively so.)

Modern, production fuzzers such as AFL and libFuzzer aren't purely random, they can be guided:  They start with an input "corpus" of examples, and then mutate them.  They use compiler and binary rewriting support to guide the exploration of mutations to maximize code coverage.  The combination of a good starting corpus and coverage-guided mutation makes for impressive results.

Midway through my year at Google, Chris Olah, Dario Amodei, and Dan ManĂ© started writing their Concrete Problems in AI Safety paper.  While many of the problems and solutions they discuss are in the realm of machine learning, I was interested in the question of the systems side, provoked in part by their example of a machine intelligence-based system exploiting a buffer overflow in its reward function (by accident, but producing undesirable results).  A malfunctioning reward function isn't the kind of thing that traditional isolation approaches such as sandboxing can prevent - it manifests as a logic bug, not an escape-into-the-rest-of-the-system.  Not that it's a complete solution, but it made me curious whether there was value in trying to fuzz TensorFlow.  And so, encouraged by Kostya Serebryany, I decided to write some adapters from libFuzzer to some of the TensorFlow kernels.

(For those who note that some of these bugs wouldn't have existed in more strongly-typed or safe languages, you're probably right.  But high-performance machine learning software has very strong demands on it from both a performance, correctness, and flexibility perspective, and I'm very sympathetic to the challenge of achieving all three of these goals simultaneously.  I'm hopeful that the new XLA compiler framework might make it easier to achieve them, but since it's still a work-in-progress, that remains to be seen!)

How to Fuzz

All libFuzzer tests start by writing a single function that tests a library call or calls, like this:

extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
  DoSomethingInterestingWithMyAPI(Data, Size);
  return 0;  // Non-zero return values are reserved for future use.

When the compiled fuzzing binary is executed, it repeatedly invokes this function with "random" (mutated) data supplied in Data.  It's your job as the fuzzing adapter writer to figure out how to map that blob of random data to calls to your application.

When fuzzing, the binary is typically compiled with LLVM's address sanitizer, which detects several common memory errors, such as out-of-bounds array accesses, and turns them into a crash.  The libFuzzer driver detects that crash and saves the example that caused it.  An output might look something like this:

dga@instance-1:~/fuzz$ ./fuzz 
INFO: Seed: 44456222
INFO: Loaded 0 modules (0 guards): 
INFO: -max_len is not provided, using 64
INFO: A corpus is not provided, starting from an empty corpus
#0 READ units: 1
#1 INITED cov: 8 units: 1 exec/s: 0
#2 NEW    cov: 9 units: 2 exec/s: 0 L: 64 MS: 0 
==1310==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fff56ccc6e8 at pc 0x0000004f69e1 bp 0x7fff56ccc690 sp 0x7fff56ccc688

WRITE of size 4 at 0x7fff56ccc6e8 thread T0

Leaving behind a file that can be used to reproduce the crash.  After you fix the bug, you would typically add this file to the fuzz seed set to prevent regressions.

What to Fuzz in TensorFlow?

Much of what TensorFlow does is numeric, and finding bugs in those operations is important - it prevents massive headaches when trying to debug why a model isn't working, for example.  But testing the numerical results requires a reference or spec that's (believed to be) correct.  Some of the tests for the XLA compiler for TensorFlow do this, for example, by using the CPU version as a reference and ensuring that the XLA version produces the same answer.  This is an important set of tests for ensuring the correctness of XLA, but when I started writing my version, I didn't have a reference, and didn't want to try to write a spec for every operation.

Instead, I focused on the thing Fuzzing tends to do best, which is finding crashes.  I started with the lowest-hanging fruit, and the place I thought bugs were most likely to linger:  Complex input parsers.  TensorFlow includes functions to encode and decode several image formats, such as tf.image.decode_png.

Starting with the image decoders made things easy - you can find existing corpora of example inputs to start with, and the functions effectively take only a single input, the string to decode into a Tensor.

Many bugs found and fixed.

This approach found bugs in the PNG decoder and the JPEG decoder.  Extra fuzzing done using automatic fuzzing infrastructure then turned up more, which Brennan Saeta kindly fixed.

After bashing against the image decoders, I turned to some of the string parsing functions, again finding a subtle bug in the strtonum function.  And then again in the TensorProto parser.

Interestingly, Fuzzing didn't just find the expected buffer overflows or other failures -- it also pointed out places where the code was unnecessarily fragile in their error handling.

A general design principle in TensorFlow is that errors in a kernel should be returned to the caller in a friendly way, so that they can deal with it appropriately.  A common pattern for handling this is to write code like this, in which immediately after entering the kernel, the programmer writes checks for as many error conditions as can be caught up-front:

explicit AsStringOp(OpKernelConstruction* ctx) : OpKernel(ctx) {
OP_REQUIRES_OK(ctx, ctx->GetAttr("T", &dtype));
  OP_REQUIRES_OK(ctx, ctx->GetAttr("precision", &precision));
... do the work here after making sure the above worked ...

This is good design for any library, since it doesn't impose your idea of failure handling on your users.  By throwing the fuzzer at TensorFlow, I was able to find a few places where errors resulted in program termination instead of having a friendly return.

I've Become a Fan of Fuzzing

Before this, I'd primarily thought of fuzzing as something you did to find security holes in other people's code.   Now I've changed my tune, and I'm convinced that using a modern fuzzer is a great part of normal software development and testing.  They won't find all of your bugs, and you have to write the adapters to let them get inside your functions, but they'll teach you new things about your code and help you find problems before they manifest in the real world.  I'm sold.

If you're interested in trying it out on your own code, a good next step would be the libFuzzer tutorial.

Happy bug squishing!

(For more of my posts, see the archive.)


  1. Are fuzzers a kind of property based testing -


Post a Comment

Popular posts from this blog

Reflecting on CS Graduate Admissions

Chili Crisp Showdown: Laoganma and Flybyjing

Two examples from the computer science review and publication process