Overlap Between ImageNet and CUB

Primer

In machine learning, it’s important that training set and testing set should never overlap. I’d like to make an analogy between this and GRE Test preparation. You will have some practice test before the real test, in order to strenghten your skills to solve graduate-level problems. The real test, however, should never contain the problem you met in the practice test.

CUB ImageNet Overlap

Some days ago when I browsed the website of Caltech-UCSD Birds-200-2011 dataset. I noticed the following sentences on top of the page:

Warning: Images in this dataset overlap with images in ImageNet. Exercise caution when using networks pretrained with ImageNet (or any network pretrained with images from Flickr) as the test set of CUB may overlap with the training set of the original network.

It is common these days to adapt a CNN model pre-trained on ImageNet and fine-tune it on the CUB dataset. It’ll be problematic if there is overlap between ImageNet training set and CUB testing set.

Since ImageNet and CUB datasets are both downloaded from internet, it is possible that the overlap image files are identical. The naive way to compare two images is to compare their pixel value using image processing packages like Pillow or OpenCV. However this naive method takes too much time. Another solution is to compare the MD5 sum values. Research shows that MD5 sum is not so reliable because there are probability of collision. For our test, there’re only ~1.3 million images and the collision probability is only about $1/2^{128}$. it’s safe to assume that if MD5 sum collides, then the original images should be identical.

I found 23 pairs of overlaps which are listed bellow. Text file for download.

I emailed Max JaderBerg, the author of “Spatial Transformer Networks”, who kindly gave me the list of overlap they found and used in the paper. They found 22/5794 overlaps which I confirmed with the MD5sum comparison. They missed one pair but the influence on the final result is tiny and neglectable.

Overall speaking, the overlap between ImageNet and CUB is small in scale and has negletable influence to the accuracy of the model.

Update

Thanks to Arun Mallya who kindly refers me to his Github Gist that contains a full list of overlaps between CUB and ImageNet found by perceptual hashing and visually examined to be the same. Perceptual hashing potentially finds visually same pairs even if underlying pixels values pertubes. Big applaud!