Collecting and sorting out data have always been a time-consuming and tedious procedure for AI developers and researchers. Here we list 20 sites where high-quality data is ready and free, in hope of assisting you to locate the proper dataset for your AI modal in a better way.
CORe50, specifically designed for (C)ontinual (O)bject (Re)cognition, is a collection of 50 domestic objects belonging to 10 categories: plug adapters, mobile phones, scissors, light bulbs, cans, glasses, balls, markers, cups and remote controls. Classification can be performed at object level (50 classes) or at category level (10 classes).
2. Caltech 101
Pictures of objects are classified into 101 categories, about 40 to 800 images per category. Most categories have about 50 images. The size of each image is roughly 300 x 200 pixels. The image data was collected in September 2003 by Fei-Fei Li, Marco Andreetto, and Marc 'Aurelio Ranzato of Caltech.
3. STL-10 dataset
The STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the CIFAR-10 dataset but with some modifications. In particular, each class has fewer labeled training examples than in CIFAR-10, but a very large set of unlabeled examples is provided to learn image models prior to supervised training.
4. THE NORB DATASET, V1.0
This database is intended for experiments in 3D object recognition from shape. It contains images of 50 toys belonging to 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. The objects were imaged by two cameras under 6 lighting conditions, 9 elevations (30 to 70 degrees every 5 degrees), and 18 azimuths (0 to 340 every 20 degrees).
ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. The project has been instrumental in advancing computer vision and deep learning research. The data is available for free to researchers for non-commercial use.
6. CBT (Children’s Book Test)
Children’s Book Test (CBT) is designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available thanks to Project Gutenberg.
7. The UCI Machine Learning Repository
The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine.
8. Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources.
Kaggle is an online community of data scientists and machine learning practitioners, where open datasets for computer vision, NLP, speech recognition, and others are available.
10. Chinese Classification Corpus from Fudan University
This is a dataset for NLP, containing 9,833 pieces of documents for testing and 9,804 pieces of documents for training. The dataset is provided by International Corpus Center of Department of Computation and Information Technology of Fudan University.
11. CMU Question-Answer Dataset
Available is a link to a corpus of Wikipedia articles, manually-generated factoid questions from them, and manually-generated answers to these questions, for use in academic research.
12. The bAbI project
The bAbI project of Facebook AI Research is organized towards the goal of automatic text understanding and reasoning. The datasets released include The (20) QA bAbI tasks, The Movie Dialog dataset, The WikiMovies dataset, and others.
13. Data Portals
With 590 data portals listed, DataPortals.org is the most comprehensive list of open data portals in the world. It is curated by a group of leading open data experts from around the world - including representatives from local, regional and national governments, international organizations such as the World Bank, and numerous NGOs.
14. Multi-Domain Sentiment Dataset
The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed.
There are thousands of datasets from financial market data and population growth to cryptocurrency prices.
16. Academic Torrents
Academic Torrents is a community-maintained distributed repository for datasets and scientific knowledge, where over 83TB of research data available.
OpenML is a collaborative system for machine learning, where datasets of various types in field of health, mechanics, internet, and finance.
Besides massive open-source code, large number of datasets can also be found at GitHub for free downloading.
OpenSLR is a site devoted to hosting speech and language resources, such as training corpora for speech recognition, and software related to speech recognition. More than 1,000 hours of English speech datasets can be found and downloaded for free.
MagicHub is an open-source datasets community developed and maintained by Magic Data Tech. Datasets covering multiple languages for machine learning model training are available.
The demand for a quick, intelligent and natural-sounding conversation between human and machine is increasing.
The 3rd China Automotive Intelligent Summit 2021, took place on 27-28, Sept. 2021, Shanghai, gathers about 120 experts and executives from the automotive industry to focus on the networked technology, software development, hardware innovation, business model and user insight of intelligent cockpit, and provide an in-depth comprehensive analysis of the opportunities and challenges of intelligent cockpit development.
As AI research and development is moving forward both in depth and breadth, the needs for structured data grow explosively. Meanwhile, the data labeling industry is undergoing decentralization: production of structured data is shifting from large-scale third party data processing centers to scattered data end-users.
The annual INTERSPEECH, held between August 30, and September 3, 2021, is a global conference organized by International Speech Communication Association (ISCA). The INTERSPEECH 2021 is held in hybrid form, that is participants can join the conference virtually online and physically in Brno, The Czech Republic.
Annotator® 5.0, an independently developed data annotation system, was official launched by Magic Data Tech on July 8, 2021 on World Artificial Intelligence Conference (WAIC) 2021, which is a global gathering and exchange of AI innovation ideas, technologies, and applications, held on Shanghai between July 7 and July 10.