CNN Baseline for Fine Grained Recognition

Fine grained recognition, or subordinate categorization as called in psychology, refers to the visual task of classifying between very similar categories. Differentiating categories like cars, bikes and persons are quite easy tasks for humans and that’s what psychologists refer to as basic-level categorization. In reality, some categories have very minor difference between each other and amateurs who has little experience or domain knowledge often make mistakes in this task. For example there are more than 900 bird species in North America alone, experts who are so familiar with birds can identify the species of a bird without difficulty, but for me, the total number bird species I can name is less than twenty. Another popular dataset in fine grained recognition is the car dataset. I myself am very interested in cars and have trained myself with thousands of cars on the street. By the way, in compact car market I like the 2016 Mazda 3 and Honda Civic as they both have very sporty exterior.


CUB birds
Can you identify which species the left bird belongs to?

The computer vision conferences are flooded with convolutional neural networks nowadays, dating back to the prize-winning Alex Net in the 2012 CVPR paper by Alex Krizhevsky, Ilya Sutskever and Geoffery Hinton. CNN is conquering a large variety of vision tasks such as classification, localization and segmentation. I heard the professor on my computer vision class saying that: “when you see someone who you would never imagine doing research on deep learning is just doing it, you will understand how hot this topic is right now.”


How CNN sees Tesla Model S
How CNN sees Tesla Model S

A very natural question in my mind is what makes fine grained recognition so different from general classification. I know in fine grained recognition the features to differentiate between two similar categories are very subtle and hard to capture by traditional computer vision algorithms. However, nowadays it has become a tale for the deep learning algorithms that they seems to serve as a strong baseline for this task, just as the paper “CNN Features off-the-shelf” said. It is important that we know how those models performing well in ImageNet can do in fine grained recognition.

The goal of this blog post is to give a baseline result on a range of popular fine grained dataset with the fine-tuned models from ILSVRC(ImageNet Large Scale Visual Recognition Challenge). To be specific, I will explore two bird datasets and one car dataset, they are:

Name Year Total images Total classes Category
CUB-200-2011 2011 11,788 200 Bird
NABirds 2015 70,000 555 Bird
Stanford cars 2013 16,185 196 Car

I will use 4 popular CNN models from ILSVRC. They are:

Name Year top-1 err (%) top-5 err (%) Place
Alex Net 2012 38.1 16.4 1st
VGG Net 2014 24.7 7.32 2nd
GoogLeNet 2014 - 6.67 1st
ResNet 2015 - 3.57 1st

I use the 16-Layer version of VGG Net. I use Caffe for this experiment. For Alex Net and GoogLeNet I use the models provided by Caffe. I downloaded the VGG Net from here and ResNet from here.

It is convenient that all these datasets come with default train test split I can make use of. Another good thing is they have bounding box annotations. Applying cropped region instead of the whole image yields a roughly 10% boost in accuracy. I found that set the test_interval to be 2000(decrease learning rate by 0.1 in every 2,000 iterations) can continuously reduce training loss. The accuracy seems to plateau after roughly 5000 iterations. I use the 20,000 iteration to get the final results(For VGG net, I found setting the test_interval 1000 and total iteration 10,000 can yield better result). The base_lr is set to be 0.001. The test result is taken from the test phrase of the training process. The test_iter is set to be 100. If batch size is 50, roughly 50 * 100 = 5000 test are performed. This is different from evaluating the model on the whole test.

Accuracy (%) CUB-200-2011 NABirds Stanford Cars
Alex Net 63.2 57.8 75.6
VGG Net 74.7 74.9 85.2
GoogLeNet 74.4 71.0 81.8
ResNet 76.3 76.0 85.0

Notice that this test is only a baseline result fine tuned from CNN models. I am by no means trying to set the state of the art. It serves just as an indicator that how the popular CNN models performs in the popular fine grained datasets. Different results can be obtained if you use different training parameters than mine.

Another interesting baseline algorithm is to extract features using the fine-tuned CNN models and train SVM classifiers to get the final results. I am expecting better results using this method because SVM normally performs better than SoftMax as used in the final layer in the CNN model to act as classifier.

Alex Net’s performance is not so good compared to the other three latest models, but it benefits from its simplicity and less training time. VGG Net, GoogLeNet and ResNet are comparable in final results. Previously, the VGG Net is often used for fine-tuning because it is said to consistently outperforms the GoogLeNet’s fine-tuning result. VGG wins in one entry and is better than GoogLeNet in all entries. The Deep Residual Network proves its superior in this experiment by winning in two entries. It should be no surprise if more and more paper use Deep Residual Network for fine-tuning in future.

The fact that the fine-tuned CNN models produce accuracies around 70~80% imply that there are still space for researchers to explore. Take CUB dataset for example, the sate-of-art results recently as I have read goes from 84.1 to 84.6 to 92.8 in just 6 months. The last paper sets a really high bar for upcoming papers, so let’s try to beat it!