Popular Text Data Sets in Matlab Format

Text datasets in matlab format

(Matlab 7 or higher version is required to open these files)

We provide here the tf vector of each document. One can use the following codes to generate the tf-idf vector. ([1+log(tf)]*log[N/df])

fea = tfidf(fea);
%Now each row of fea is the normalized tf-idf vector (the length of each vector is 1).

If you used the processed data sets on this page, we appreciate it very much if you can cite our following works:

Deng Cai, Xuanhui Wang and Xiaofei He, "Probabilistic Dyadic Data Analysis with Local and Global Consistency", ICML'09.
Bibtex source
Deng Cai, Qiaozhu Mei, Jiawei Han, ChengXiang Zhai, "Modeling Hidden Topics on Document Manifold", CIKM'08.
Bibtex source
Deng Cai, Xiaofei He, Wei Vivian Zhang, and Jiawei Han, "Regularized Locality Preserving Indexing via Spectral Regression", CIKM'07.
Bibtex source
Deng Cai, Xiaofei He and Jiawei Han, "Document Clustering Using Locality Preserving Indexing", IEEE TKDE 2005.
Bibtex source

Top 30 categories in TDT2
We provide here a subset of the original TDT2 corpus. The TDT2 corpus ( Nist Topic Detection and Tracking corpus ) consists of data collected during the first half of 1998 and taken from 6 sources, including 2 newswires (APW, NYT), 2 radio programs (VOA, PRI) and 2 television programs (CNN, ABC). It consists of 11201 on-topic documents which are classified into 96 semantic categories. In this subset, those documents appearing in two or more categories were removed, and only the largest 30 categories were kept, thus leaving us with 9,394 documents in total.
Data File: contains variables 'fea' and 'gnd'. 'fea' is the document-term matrix, each row is a document; 'gnd' is the label.
Doc ID: The corresponding document name in the original TDT2 corpus.
Terms: The corresponding feature name (terms) in the original TDT2 corpus.

Random Clusters Index Files:
2 Classes | 3 Classes |4 Classes |5 Classes |6 Classes |7 Classes |8 Classes |9 Classes |10 Classes |15 Classes |20 Classes |25 Classes

Given a cluster number, there are 50 randomly cases. Each case file contains variables 'sampleIdx' and 'zeroIdx'. The following matlab codes can be used to generate the particular set
%===========================================
fea = fea(sampleIdx,:);
gnd = gnd(sampleIdx,:);

fea(:,zeroIdx) = [];
%===========================================
All categories in TDT2
In this subset, those documents appearing in two or more categories are removed, thus leaving us with 10,212 documents in total.
Data File: contains variables 'fea' and 'gnd'. 'fea' is the document-term matrix, each row is a document; 'gnd' is the label.
Doc ID: The corresponding document name in the original TDT2 corpus.
Terms: The corresponding feature name (terms) in the original TDT2 corpus. The number is the df of each word.

Random Clusters Index Files in Top 56 Categories (In these categories, each category contains no less than 10 documents):
These random splits used in our paper D. Cai, et.al., "Locally Consistent Concept Factorization for Document Clustering", IEEE TKDE 2011
2 Classes | 3 Classes |4 Classes |5 Classes |6 Classes |7 Classes |8 Classes |9 Classes |10 Classes
All categories in Reuters21578
Reuters-21578 corpus contains 21578 documents in 135 categories. We provide here the ModApte version. Those documents with multiple category labels are discarded. It left us with 8293 documents in 65 categories. For ModeApte split, there are 5946 training documents and 2347 testing documents. After preprocessing, this corpus contains 18933 distinct terms.
Data File: contains variables 'fea' and 'gnd', 'trainIdx' and 'testIdx'. 'fea' is the document-term matrix, each row is a document; 'gnd' is the label; 'trainIdx' and 'testIdx' are the indexes of the train/test split.
Topics: The category names in the original Reuters corpus.
Doc ID: The corresponding document name in the original Reuters corpus.
Terms: The corresponding feature name (terms) in the original Reuters corpus. The number is the df of each word.

Random Clusters Index Files in Top 30 Categories :
2 Classes | 3 Classes |4 Classes |5 Classes |6 Classes |7 Classes |8 Classes |9 Classes |10 Classes |15 Classes |20 Classes |25 Classes

Random Clusters Index Files in Top 41 Categories (In these categories, each category contains no less than 10 documents):
These random splits used in our paper D. Cai, et.al., "Locally Consistent Concept Factorization for Document Clustering", IEEE TKDE 2011
2 Classes | 3 Classes |4 Classes |5 Classes |6 Classes |7 Classes |8 Classes |9 Classes |10 Classes

Given a cluster number, there are 50 randomly cases. Each case file contains variables 'sampleIdx' and 'zeroIdx'.
The following matlab codes can be used to generate the particular set
%===========================================
fea = fea(sampleIdx,:);
gnd = gnd(sampleIdx,:);

fea(:,zeroIdx) = [];
%===========================================
20 Newsgroups (version 1, provided on the homepage)
This version is processed and provided on the homepage of 20 Newsgroups data set. We provide here the matlab format simply for the conveniency.
Data File: contains variables 'fea', 'gnd', 'trainIdx' and 'testIdx'. 'fea' is the document-term matrix, each row is a document; 'gnd' is the label; 'trainIdx' and 'testIdx' are the indexes of the train/test split.
Terms: Corresponding word for each dimension.
Topics: The category name for each class.

We provide here another modified version. We simply re-arrange the words(features) according to the DF of the words in the corpus.
Data File: contains variables 'fea', 'gnd', 'trainIdx' and 'testIdx'. 'fea' is the document-term matrix, each row is a document; 'gnd' is the label; 'trainIdx' and 'testIdx' are the indexes of the train/test split.
Terms: Corresponding word for each dimension. The number is the df of each word.
20 Newsgroups (version 2)
Please find the homepage of 20 Newsgroups data set at here. We use the 20 Newsgroups sorted by date version (20news-bydate.tar.gz). The original website reports that there are 18941 documents which is not correct. There are only 18846 documents, with 11314 (60%) training and 7532 (40%) testing.
This bydate version is recommended by the orignal provider since "I recommend the "bydate" version since cross-experiment comparison is easier (no randomness in train/test set selection), newsgroup-identifying information has been removed and it's more realistic because the train and test sets are separated in time. "
Data File: contains variables 'fea', 'gnd', 'trainIdx' and 'testIdx'. 'fea' is the document-term matrix, each row is a document; 'gnd' is the label; 'trainIdx' and 'testIdx' are the indexes of the train/test split.
Feature File: Corresponding word for each dimension. The number is the df of each word.
The following matlab codes can be used to generate training and test sets
%===========================================
feaTrain = fea(trainIdx,:);
gndTrain = gnd(trainIdx,:);

feaTest = fea(testIdx,:);
gndTest = gnd(testIdx,:);
%===========================================

Besides the orignal (60%,40%) split, we provide here other splits (5%, 10%, ... Training). Different from the orignal split, these splits are purely random (not separated in time).
5% Training | 10% Training | 20% Training | 30% Training | 40% Training | 50% Training |
Selected 4 categories in RCV1
We provide here a subset of the RCV1 corpus. Please check ( this link ) for details about RCV1 corpus.
In this subset, there are 9,625 documents with 29,992 distinct words, including categories "C15", "ECAT", "GCAT", and "MCAT", each with 2,022, 2,064, 2,901, and 2,638 documents respectively.
This data is used in our paper:
Deng Cai, Xiaofei He, "Manifold Adaptive Experimental Design for Text Categorization," IEEE Transactions on Knowledge and Data Engineering, 28 Apr. 2011.
Data File: contains variables 'fea' and 'gnd'. 'fea' is the document-term matrix, each row is a document; 'gnd' is the label.

Return to Codes and Data