IMAGENET1M, A Dataset for Large Scale Content Based Image Retrieval

(Semi-supervised Approximate Nearest Neighbor Search)

Overview

This is a benchmark dataset, IMAGENET1M, for (semi-supervised) approximate nearest neighbors search (ANNS) algorithms. IMAGENET1M contains 2048-dimensional real-valued features obtained from a deep neural network model on 10%-labeled ILSVRC-15 dataset.

IMAGENET1M has 4 parts:

Base: serves as ANNS base set. The corresponding images are those in the ILSVRC-15 training set.
Query: serves as ANNS query set. The corresponding images are those in half of the ILSVRC-15 validation set: we randomly selected 50% images from each category.
Training: for deep neural network training. The corresponding images are those in 10% of the ILSVRC-15 training set: we randomly selected 10% images from each category.
Validation: for deep neural network validation. The corresponding images are the other 50% images in the ILSVRC-15 validation set.

For each part, we provide both the 2048-dimensional features and the image list.

Why this dataset?

A Content Based Image Retrieval (CBIR) system is essentially a semi-supervised approximate nearest neighbor search system. Some of the samples in the database have labels and we want to develop effective semi-supervised ANNS methods to provide efficient search.

However,

The widely used ANNS benchmark datasets (e.g., SIFT1M and GIST1M) are unsupervised , and are not appropriate for evaluating CBIR ( semi-supervised ANNS) methods.
The datasets (e.g. MNIST, CIFAR10, NUS-WIDE) used in many ANNS papers (mainly hashing papers) are too small and too simple. These datasets are also not appropriate for evaluating large scale CBIR methods.

Thus, we released this large scale semi-supervised ANNS benchmark dataset.

How to use

The dataset can be used for evaluating an unsupervised ANNS method by simply ignoring all the labels, just same as the SIFT1M and GIST1M. The samples have 2048-dimensional real-valued features, which is much larger than that of SIFT1M (128) and GIST1M (960). If this is the case, we provided a 100nn brute-force search result here for evaluation.
The dataset can be used for evaluating a semi-supervised ANNS method. If this is the case, one can only use the labels of the training and validation set for learning the semi-supervised ANNS method. The following table show the P@10,50,100 of the brute-force search:

K	10	50	100
P@K	0.424504	0.39418	0.377997

Download

In the following table, we provide the details about the dataset and the link to download the dataset.

Dataset split Data dimension # Data Dowload features MD5 Image list Label

Base 2048 1,281,167 Download base features (8.79GB) 21a976548a419a27ac6393e8f399f346 base_image_list base_label training indics (0 represents not in the train, 1 represents in the train)

Query 2048 25,000 Download query features (175MB) 566487cfc54650154a405efa301bb876 query_image_list query_label

Training 2048 128,161 can be generated using base feature and training indics can be generated using base_image_list and training indics

Validation 2048 25,000 Download validation features (175MB) ecb5eea2f20e1355590bbe670006cfbc val_image_list val_label

The features are in the fvecs format; the brute-force search results are in the ivecs format (more information on the fvecs and ivecs format). The image list and label are in TXT.

Reference

If you used this data set, we appreciate it very much if you can cite our following paper:

Deng Cai, Xiuye Gu and Chaoqi Wang, "A Revisit of Deep Hashings for Large-scale Content Based Image Retrieval", arXiv:1711.06016 . Bibtex source

Contact

If you have any questions or sugesstions, please feel free to contact us!

Deng Cai (dengcai@gmail.com)

Xiuye Gu (gxy0922@zju.edu.cn)

Chaoqi Wang (cqwong@zju.edu.cn)

Return to Codes and Data

Dataset split	Data dimension	# Data	Dowload features	MD5	Image list	Label
Base	2048	1,281,167	Download base features (8.79GB)	21a976548a419a27ac6393e8f399f346	base_image_list	base_label	training indics (0 represents not in the train, 1 represents in the train)
Query	2048	25,000	Download query features (175MB)	566487cfc54650154a405efa301bb876	query_image_list	query_label
Training	2048	128,161	can be generated using base feature and training indics		can be generated using base_image_list and training indics
Validation	2048	25,000	Download validation features (175MB)	ecb5eea2f20e1355590bbe670006cfbc	val_image_list	val_label