Neural Codes for Image Retrieval

An example of the retrieval with neural codes. Each row shows the query on the left and the images from the INRIA dataset with the most similar neural codes to the right. The three rows correspond to the three layers in the convolutional neural network.

In this project, we investigate what are the best ways to get global (holistic) image descriptors out of deep neural networks. We aim at relatively low-dimensional descriptors (e.g. 128-256 dimensions to describe an image).

The current “best” way that we have found is summarized in the ICCV 2015 paper. In short, the descriptors are based on the last convolutional layer of a pretrained deep network, sum-pooling, and simple post-processing. We call the resulting descriptor SPoC (for Sum-Pooling of Convolutional features).

In the ICCV paper, we obtained very competitive retrieval accuracy while using a network pretrained on ImageNet and not fine-tuned for buildings/landmarks. However, in our earlier ECCV 2014 paper, we have shown that such fine-tuning is beneficial and have collected a special dataset suitable for such finetuning (the Landmarks dataset). Note that at the moment we recommend SPoC (and similar) descriptors over using outputs of the fully-connected layers (as in the ECCV 2014 paper) when generic image descriptors are required. Fine-tuning on related dataset should be used if maximal performance is desired, as our preliminary experiments suggest that SPoC also benefit from fine-tuning though to a lesser degree than fully-connected features.

Papers:
A. Babenko and V. Lempitsky. Aggregating local deep features for image retrieval. IEEE International Conference on Computer Vision (ICCV), Santiago de Chile, 2015
arXiv:1510.07493

A. Babenko, A. Slesarev, A. Chigorin and V. Lempitsky. Neural Codes for Image Retrieval, European Conference on Computer Vision (ECCV), Zurich, 2014
Paper

Code:
The code and usage example for the SPOC descriptor: https://github.com/arbabenko/Spoc

The Landmarks Dataset:
Tab-separated file
Format of each line: url image_id class_name (in Russian)
Note: Oxford-related classes were manually removed for our experiments in the ECCV 2014 paper.

Related work:
Several other groups have been investigating similar directions recently and in parallel. Some very interesting results have been presented by the KTH Stockholm group:
http://www.csc.kth.se/cvap/cvg/DL/ots/