Company
blog
Blog
blog
21
Jun
29
Browse: 699
Worthy of Bookmarking! | 20 Websites to get Free Datasets for Your AI Model Training

Collecting and sorting out data have always been a time-consuming and tedious procedure for AI developers and researchers. Here we list 20 sites where high-quality data is ready and free, in hope of assisting you to locate the proper dataset for your AI modal in a better way.

Image Data

1. CORe50

CORe50, specifically designed for (C)ontinual (O)bject (Re)cognition, is a collection of 50 domestic objects belonging to 10 categories: plug adapters, mobile phones, scissors, light bulbs, cans, glasses, balls, markers, cups and remote controls. Classification can be performed at object level (50 classes) or at category level (10 classes).

URL: https://vlomonaco.github.io/core50/

2. Caltech 101

Pictures of objects are classified into 101 categories, about 40 to 800 images per category. Most categories have about 50 images. The size of each image is roughly 300 x 200 pixels. The image data was collected in September 2003 by Fei-Fei Li, Marco Andreetto, and Marc 'Aurelio Ranzato of Caltech.

URL: https://cs.stanford.edu/~acoates/stl10/

3. STL-10 dataset

The STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the CIFAR-10 dataset but with some modifications. In particular, each class has fewer labeled training examples than in CIFAR-10, but a very large set of unlabeled examples is provided to learn image models prior to supervised training.

URL: https://cs.stanford.edu/~acoates/stl10/

4. THE NORB DATASET, V1.0

This database is intended for experiments in 3D object recognition from shape. It contains images of 50 toys belonging to 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. The objects were imaged by two cameras under 6 lighting conditions, 9 elevations (30 to 70 degrees every 5 degrees), and 18 azimuths (0 to 340 every 20 degrees).

URL: https://cs.nyu.edu/~ylclab/data/norb-v1.0/

5. ImageNet

ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. The project has been instrumental in advancing computer vision and deep learning research. The data is available for free to researchers for non-commercial use.

URL: http://image-net.org/index

6. CBT (Children’s Book Test)

Children’s Book Test (CBT) is designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available thanks to Project Gutenberg.

RUL: http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz

7. The UCI Machine Learning Repository

The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine.

URL: http://archive.ics.uci.edu/ml/index.php

Miscellaneous

8. Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources.

URL: https://registry.opendata.aws/

9. Kaggle

Kaggle is an online community of data scientists and machine learning practitioners, where open datasets for computer vision, NLP, speech recognition, and others are available.

URL: https://www.kaggle.com/

10. Chinese Classification Corpus from Fudan University

This is a dataset for NLP, containing 9,833 pieces of documents for testing and 9,804 pieces of documents for training. The dataset is provided by International Corpus Center of Department of Computation and Information Technology of Fudan University.

URL: https://www.kesci.com/mw/dataset/5d3a9c86cf76a600360edd04

11. CMU Question-Answer Dataset

Available is a link to a corpus of Wikipedia articles, manually-generated factoid questions from them, and manually-generated answers to these questions, for use in academic research.

URL: http://www.cs.cmu.edu/~ark/QA-data/

12. The bAbI project

The bAbI project of Facebook AI Research is organized towards the goal of automatic text understanding and reasoning. The datasets released include The (20) QA bAbI tasks, The Movie Dialog dataset, The WikiMovies dataset, and others.

URL: https://research.fb.com/projects/babi/

13. Data Portals

With 590 data portals listed, DataPortals.org is the most comprehensive list of open data portals in the world. It is curated by a group of leading open data experts from around the world - including representatives from local, regional and national governments, international organizations such as the World Bank, and numerous NGOs.

URL: http://dataportals.org/

14. Multi-Domain Sentiment Dataset

The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed.

URL: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/

15. DataHub

There are thousands of datasets from financial market data and population growth to cryptocurrency prices.

URL: https://datahub.io

16. Academic Torrents

Academic Torrents is a community-maintained distributed repository for datasets and scientific knowledge, where over 83TB of research data available.

URL: https://github.com/awesomedata/awesome-public-datasets

17. OpenML

OpenML is a collaborative system for machine learning, where datasets of various types in field of health, mechanics, internet, and finance.

URL: https://www.openml.org/

18. GitHub

Besides massive open-source code, large number of datasets can also be found at GitHub for free downloading.

URL: https://github.com/awesomedata/awesome-public-datasets

19. OpenSLR

OpenSLR is a site devoted to hosting speech and language resources, such as training corpora for speech recognition, and software related to speech recognition. More than 1,000 hours of English speech datasets can be found and downloaded for free.

URL: http://www.openslr.org/

20. MagicHub

MagicHub is an open-source datasets community developed and maintained by Magic Data Tech. Datasets covering multiple languages for machine learning model training are available.

URL: http://magichub.io

Share
Previous
Page
What is Conversational AI? And the challenge
Next
Page
Massive High-Quality AI Training Data Makes HMI More Intelligent, More Humanized and More Personalized
Popular Tags
Latest Blogs
What is Conversational AI? And the challenge

The demand for a quick, intelligent and natural-sounding conversation between human and machine is increasing.

21
Jun
29
Massive High-Quality AI Training Data Makes HMI More Intelligent, More Humanized and More Personalized

The 3rd China Automotive Intelligent Summit 2021, took place on 27-28, Sept. 2021, Shanghai, gathers about 120 experts and executives from the automotive industry to focus on the networked technology, software development, hardware innovation, business model and user insight of intelligent cockpit, and provide an in-depth comprehensive analysis of the opportunities and challenges of intelligent cockpit development.

21
Jun
29
Annotator® 5.0 Data Labeling Platform Empowers the AI Industry in Data Labeling

As AI research and development is moving forward both in depth and breadth, the needs for structured data grow explosively. Meanwhile, the data labeling industry is undergoing decentralization: production of structured data is shifting from large-scale third party data processing centers to scattered data end-users.

21
Jun
29
Magic Data Tech Joins INTERSPEECH 2021 | "Annotator® 5.0 SaaS Free Version” Unleashes the Potential of Data Labeling

The annual INTERSPEECH, held between August 30, and September 3, 2021, is a global conference organized by International Speech Communication Association (ISCA). The INTERSPEECH 2021 is held in hybrid form, that is participants can join the conference virtually online and physically in Brno, The Czech Republic.

21
Jun
29
Magic Data Tech announced Launch of Annotator® 5.0, An AI-Assisted Data Annotation Platform

Annotator® 5.0, an independently developed data annotation system, was official launched by Magic Data Tech on July 8, 2021 on World Artificial Intelligence Conference (WAIC) 2021, which is a global gathering and exchange of AI innovation ideas, technologies, and applications, held on Shanghai between July 7 and July 10.

21
Jun
29
Sales Department
Please fill in this form to purchase datasets or quote for
data collection/ annotation services.
Name
*
Company Name
*
Email
*
Phone Number
*
Detail
Country
City
Submit
Sales Department
Please fill in this form and we will contact you soon
Name
*
Company Name
*
Email
*
Phone Number
*
Detail
Country
City
Submit
Resources Department
If you want to be our data collection and annotation team
member, please fill in this form.
DATA COLLECTION PROJECTS
Language*
Location*
DATA ANNOTATION PROJECTS
Language*
CONTACT INFORMATION
Name*
Company Name*
E-mail*
Phone Number*
Experience*
Address*
Submit
Marketing Department
If you want to forward our article or tell us marketing
events, please fill in this form.
Name
*
Company Name
*
Email
*
Phone Number
*
Detail
Submit
Human Resources Department
Please fill in this form to be a member of Magic Data Tech.
Name
*
Email
*
Phone Number
*
Job
*
Upload Resume
Submit
Sample Download
Name*
E-mail*
Phone Number*
Company Name*
Job
Department
Company Product
I am also interested in the following data:
Languages
Style
Scenario

We will contact you via telephone to confirm your information and provide the method to download.
Submit
Submission Successful!
We will contact you as soon as possible.
This page would be
closed in 3 seconds automatically.
>
TOP