Posts

Showing posts from February, 2016

Cleaning the Imagenet Dataset, collected notes

Image
As part of my sabbatical at Google, I spent the last month working on processing images from the Imagenet Large Scale Visual Recognition Challenge (ILSVRC 2012) dataset  using  Tensorflow .  (Note that I've linked to the '14 dataset because it contains the image blacklist I discuss below, but the it has the same classification images as the '12 dataset). As is well-known enough that there's an entire subreddit dedicated to it , cleaning data before feeding it into a machine learning system is both time-consuming and somewhat annoying.  Despite being a curated "challenge" dataset, it turns out that ILSVRC'12 needs cleaning as well.  Much of this is known already among people who use the dataset, but with the recent explosion in popularity of machine and deep learning, I figured I'd put my collected notes here to save others the time. Without further ado, the ILSVRC 2014 Data Gotchas: Images in the wrong format: (1)  Unlike each of its ~million