Cleaning the Imagenet Dataset, collected notes

As part of my sabbatical at Google, I spent the last month working on processing images from the
Imagenet Large Scale Visual Recognition Challenge (ILSVRC 2012) dataset using Tensorflow.  (Note that I've linked to the '14 dataset because it contains the image blacklist I discuss below, but the it has the same classification images as the '12 dataset).

As is well-known enough that there's an entire subreddit dedicated to it, cleaning data before feeding it into a machine learning system is both time-consuming and somewhat annoying.  Despite being a curated "challenge" dataset, it turns out that ILSVRC'12 needs cleaning as well.  Much of this is known already among people who use the dataset, but with the recent explosion in popularity of machine and deep learning, I figured I'd put my collected notes here to save others the time.

Without further ado, the ILSVRC 2014 Data Gotchas:

Images in the wrong format:

(1)  Unlike each of its ~million peers, Image n02105855_2933.JPEG is not a JPEG.  It's a PNG.  Oh, it's named '.JPEG', but it's not.  If you feed it to tf.image.decode_jpeg, you'll end up with an exception.  In other contexts, you may end up with workers silently dying.

(2)  22 of the images are CMYK JPEGs, for which there's no real standard.  Many decoders, Tensorflow and Matlab included, don't like these.

Solution:  Convert these before doing anything else.

If converting to a record of tf.Example's, the most robust approach is to use an image processing library to check the type of the images as you load them.  An easy hack with imagenet is to click the above link to a github repository that lists the bad ones, and munge them using the ImageMagick convert utility.

Two sets of bounding boxes:

If you download the bounding boxes from the easy-to-nagivate-to download bounding boxes page, you will get one set.  But there is a more comprehensive set of bounding boxes available from a much harder-to-find bboxes page once you've registered.  (It is not public, so I'm not linking it here.  Go register, and you can dig it up on the signed-up page.)  The latter is the one used by, e.g., Google's 2014 winning entry, Inception.  Don't miss out on those extra bboxes if you're using the info.

Invalid bounding boxes:

Once you have all of the bounding boxes, life still isn't optimal:  Some of the bounding boxes are defined to occur completely outside (or partially outside) of the image they describe!

I chose to clean these up in pre-preocessing rather than complicate my model with a bunch of corner case conditionals.

Blacklisted or non-blacklisted validation set:

Be sure you're comparing apples-to-apples.  The validation set consists of 50,000 images.   In the 2014 devkit, there is a list of 1762 "blacklisted" files for which the bboxes (or perhaps the labels) are too ambiguous or wrong.  Make sure you're comparing the same thing.  For a while, I was accidentaly comparing results using the full 50,000 against the published numbers from papers that omit the blacklisted images.  It made something like a 0.3% overall difference in top-1 accuracy (ouch).

Note that this blacklist is new, and should most likely be applied only for an object detection task, not an image classification task, even though the file is named in a way that makes it seem like it might apply for both.  If you're doing whole-image classification, use the 50,000 -- but check to make sure that the things you're comparing against also did so!

Hope this saves someone a bit of hassle, and happy learning! 

Popular posts from this blog

22 Months of Voice-Based Assistants

Stealing Google's Coding Practices for Academia

Reflecting on CS Graduate Admissions