New open data sets from Microsoft Research

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Microsoft has released a number of data sets produced by Microsoft Research and made them available for download at Microsoft Research Open Data.

The Datasets in Microsoft Research Open Data are categorized by their primary research area, such as Physics, Social Science, Environmental Science, and Information Science. Many of the data sets have not been previously available to the public, and many are large and useful for research in AI and Machine Learning techniques. Many of the datasets also include links to associated papers from Microsoft Research. For example, the 10Gb DESM Word Embeddings dataset provides the IN and the OUT word2vec embeddings for 2.7M words trained on a Bing query corpus of 600M+ queries.

Other data sets of note include:

  • A collection of 38M tweets related to the 2012 US election
  • 3-D capture data from individuals performing a variety of hand gestures
  • Infer.NET, a framework for running Bayesian inference in graphical models
  • Images for 1 million celebrities, and associated tags
  • MS MARCO, is a new large-scale dataset for reading comprehension and question answering

Most data sets are provided as plain text files, suitable for importing into Python, R, or other analysis tools. In addition to downloading the data, you can also deploy the datasets for analysis in to Microsoft Azure: see the FAQ for details on this and other ways the datasets can be used.

As of this writing, the archive includes 51 datasets. You can explore and download the Microsoft Research datasets at the link below.

Microsoft: Microsoft Research Open Data

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.