IMAGENET1M, A Dataset for Large Scale Content Based Image Retrieval

(Semi-supervised Approximate Nearest Neighbor Search)


This is a benchmark dataset, IMAGENET1M, for (semi-supervised) approximate nearest neighbors search (ANNS) algorithms. IMAGENET1M contains 2048-dimensional real-valued features obtained from a deep neural network model on 10%-labeled ILSVRC-15 dataset.

IMAGENET1M has 4 parts:

For each part, we provide both the 2048-dimensional features and the image list.

Why this dataset?

A Content Based Image Retrieval (CBIR) system is essentially a semi-supervised approximate nearest neighbor search system. Some of the samples in the database have labels and we want to develop effective semi-supervised ANNS methods to provide efficient search.


Thus, we released this large scale semi-supervised ANNS benchmark dataset.

How to use

K 10 50 100
P@K 0.424504 0.39418 0.377997


In the following table, we provide the details about the dataset and the link to download the dataset.
Dataset split Data dimension # Data Dowload features MD5 Image list Label
Base 2048 1,281,167 Download base features (8.79GB) 21a976548a419a27ac6393e8f399f346 base_image_list base_label training indics (0 represents not in the train, 1 represents in the train)
Query 2048 25,000 Download query features (175MB) 566487cfc54650154a405efa301bb876 query_image_list query_label
Training 2048 128,161 can be generated using base feature and training indics can be generated using base_image_list and training indics
Validation 2048 25,000 Download validation features (175MB) ecb5eea2f20e1355590bbe670006cfbc val_image_list val_label

The features are in the fvecs format; the brute-force search results are in the ivecs format (more information on the fvecs and ivecs format). The image list and label are in TXT.


If you used this data set, we appreciate it very much if you can cite our following paper:


If you have any questions or sugesstions, please feel free to contact us!

Deng Cai (

Xiuye Gu (

Chaoqi Wang (

Return to Codes and Data