Collecting and sorting out data have always been a time-consuming and tedious procedure for AI developers and researchers. Here we list 20 sites where high-quality data is ready and free, in hope of assisting you to locate the proper dataset for your AI modal in a better way.
CORe50, specifically designed for (C)ontinual (O)bject (Re)cognition, is a collection of 50 domestic objects belonging to 10 categories: plug adapters, mobile phones, scissors, light bulbs, cans, glasses, balls, markers, cups and remote controls. Classification can be performed at object level (50 classes) or at category level (10 classes).
2. Caltech 101
Pictures of objects are classified into 101 categories, about 40 to 800 images per category. Most categories have about 50 images. The size of each image is roughly 300 x 200 pixels. The image data was collected in September 2003 by Fei-Fei Li, Marco Andreetto, and Marc 'Aurelio Ranzato of Caltech.
3. STL-10 dataset
The STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the CIFAR-10 dataset but with some modifications. In particular, each class has fewer labeled training examples than in CIFAR-10, but a very large set of unlabeled examples is provided to learn image models prior to supervised training.
4. THE NORB DATASET, V1.0
This database is intended for experiments in 3D object recognition from shape. It contains images of 50 toys belonging to 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. The objects were imaged by two cameras under 6 lighting conditions, 9 elevations (30 to 70 degrees every 5 degrees), and 18 azimuths (0 to 340 every 20 degrees).
ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. The project has been instrumental in advancing computer vision and deep learning research. The data is available for free to researchers for non-commercial use.
6. CBT (Children’s Book Test)
Children’s Book Test (CBT) is designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available thanks to Project Gutenberg.
7. The UCI Machine Learning Repository
The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine.
8. Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources.
Kaggle is an online community of data scientists and machine learning practitioners, where open datasets for computer vision, NLP, speech recognition, and others are available.
10. Chinese Classification Corpus from Fudan University
This is a dataset for NLP, containing 9,833 pieces of documents for testing and 9,804 pieces of documents for training. The dataset is provided by International Corpus Center of Department of Computation and Information Technology of Fudan University.
11. CMU Question-Answer Dataset
Available is a link to a corpus of Wikipedia articles, manually-generated factoid questions from them, and manually-generated answers to these questions, for use in academic research.
12. The bAbI project
The bAbI project of Facebook AI Research is organized towards the goal of automatic text understanding and reasoning. The datasets released include The (20) QA bAbI tasks, The Movie Dialog dataset, The WikiMovies dataset, and others.
13. Data Portals
With 590 data portals listed, DataPortals.org is the most comprehensive list of open data portals in the world. It is curated by a group of leading open data experts from around the world - including representatives from local, regional and national governments, international organizations such as the World Bank, and numerous NGOs.
14. Multi-Domain Sentiment Dataset
The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed.
There are thousands of datasets from financial market data and population growth to cryptocurrency prices.
16. Academic Torrents
Academic Torrents is a community-maintained distributed repository for datasets and scientific knowledge, where over 83TB of research data available.
OpenML is a collaborative system for machine learning, where datasets of various types in field of health, mechanics, internet, and finance.
Besides massive open-source code, large number of datasets can also be found at GitHub for free downloading.
OpenSLR is a site devoted to hosting speech and language resources, such as training corpora for speech recognition, and software related to speech recognition. More than 1,000 hours of English speech datasets can be found and downloaded for free.
MagicHub is an open-source datasets community developed and maintained by Magic Data Tech. Datasets covering multiple languages for machine learning model training are available.