Data for MATLAB hackers

Here are some datasets in MATLAB format. I'm working on better documentation, but if you decide to use one of these and don't have enough info, send me a note and I'll try to help. Also, if you discover something, let me know and I'll try to include it for others. There is a Matlab Tutorial here.

Handwritten Digits

  • MNIST Handwritten Digits [data/mnist_all.mat]
    [training pictures: 0 1 2 3 4 5 6 7 8 9 ]
    [testing pictures: 0 1 2 3 4 5 6 7 8 9 ]
    8-bit grayscale images of "0" through "9"; about 6K training examples of each class; 1K test examples
  • USPS Handwritten Digits [data/usps_all.mat]
    [pictures: 0 1 2 3 4 5 6 7 8 9 ]
    8-bit grayscale images of "0" through "9"; 1100 examples of each class.
  • Binary Alphadigits [data/binaryalphadigs.mat] [picture]
    Binary 20x16 digits of "0" through "9" and capital "A" through "Z". 39 examples of each class.
    From Simon Lucas' (sml@essex.ac.uk), Algoval system.

Faces

  • If you want a real face dataset, I strongly recommend the UMass project: Labelled Faces in the Wild.

  • Frey Face [data/frey_rawface.mat] [picture]
    From Brendan Frey. Almost 2000 images of Brendan's face, taken from sequential frames of a small video. Size: 20x28.

  • Olivetti Faces [data/olivettifaces.mat] [picture]
    Grayscale faces 8 bit [0-255], a few images of several different people.
    400 total images, 64x64 size.
    From the Oivetti database at ATT.

  • UMist Faces [data/umist_cropped.mat] [picture]
    Grayscale faces 8 bit [0-255], a few images (views) of 20 different people.
    575 total images, 112x92 size, manually cropped by Daniel Graham at UMist
    Citation: Characterizing Virtual Eigensignatures for General Purpose Face Recognition, Daniel B Graham and Nigel M Allinson. In Face Recognition: From Theory to Applications ; NATO ASI Series F, Computer and Systems Sciences, Vol. 163; H. Wechsler, P. J. Phillips, V. Bruce, F. Fogelman-Soulie and T. S. Huang (eds), pp 446-456, 1998.
    [original uncropped data]

Text

  • Word Counts from Encyclopedia Articles
    Here's a tiny subset of word counts from some Grolier encyclopedia articles. Only the 15K most common words are used in the vocabulary, and only about 31K articles are represented.
    The data is represented as a sparse matrix of counts.
    In the csv file, for each article there is one line of the form:
    article_number,word_id,word_count,word_id,word_count,...
    In the matlab sparse matrix, each row is a word and each column is an article and the entries are the counts.
    [the word list csv ascii data matlab sparse matrix data ]

  • PNAS Titles
    The titles of every paper to appear in the Proceedings of the National Academy of Sciences until March2005, along with the date of publication of the paper. The data was obtained by crawling the PNAS website and downloading the table of contents from every issue of every volume and yielded about 80,000 papers over the years 1915-2005.
    [raw html ascii data matlab raw data ]

  • NIPS Conference Papers Vols0-12 [matlab or raw data]
    A whole lot of fun! I massaged the OCR'd data from NIPS1-12 (the pre-electronic submission era) that Yann made available.
    I've included a tarball of the massaged raw data, as well as a matlab package which is nicely read in and pre-processed.
    See the readme file for the raw data massaging notes and the matlab notes file for explanations of the matlab data.
    There are also a couple of extra matlab files, containing conference and page number info which you can't make yourself but seems boring to me, and word counts by author which is cool, but you could easily have made yourself.
    NEW: Check out Gal's page with updated data for NIPS1-17.

  • 20 Newsgroups [data/20news_w100.mat]
    A tiny version of the 20newsgroups data, with binary occurance data for 100 words across 16242 postings. I've also tagged the postings by the highest level domain in the array "newsgroups".

Articulatory Speech Data

  • See this page for resources relating to the University of Wisconsin X-ray Microbeam Database (UBDB). Hopefully coming soon.

[ | Information | Research | Teaching | Professional | ]

Sam Roweis, Vision, Learning and Graphics Group, NYU, www.cs.nyu.edu/~roweis