Nowadays, a new round of technological revolution and industrial transformation is rising, which promote the rapid evolution of digital technology represented by artificial intelligence, and human beings are marching into an intelligent society. According to the White Paper on the Core Technology Industry of Artificial Intelligence, released by the China Academy of Information and Communications in April this year, AI has fully covered the basic elements of social operation and improved overall operating efficiency. In the future, AI will be as ubiquitous as water and electricity, disrupting and transforming every industry.
Data play an important role in supporting the development of artificial intelligence. Artificial intelligence model needs massive data to train and optimize. Among the three core elements, data, algorithm and arithmetic power, the focus is changing from algorithm to data. Data sets the upper bound for machine learning, and only when data is valued by developers, more accurate models can be trained. Enda Wu, a leading machine-learning expert, argues that machine learning will evolve rapidly if more emphasis is placed on data rather than model.
755 Hours of Mandarin Chinese Speech Corpora newly arrive on MagicHub
To drive the development of artificial intelligence technology, Magic Data Tech launched the open-source community, MagicHub (https://magichub.io/), releasing large amounts of data for developers around the world. Recently, Magic Data Tech continues to release a batch of 755 hours of Mandarin Chinese Speech Corpus in the community. The corpora was previously open-source at OpenSLR, and will be linked to the community for free download. Click here to download.
The 755-hour Mandarin Chinese speech corpus, which has a total duration of 10566.9 hours, has been instrumental in the Exploring Methods for the Automatic Detection of Errors in Manual Transcription, a research project by Language and Speech Processing Centre of Johns Hopkins University.
Indonesian and Malay Conversational Speech Corpora
Indonesian and Malay conversational speech corpora are released in MagicHub community recently by Magic Data Tech, providing developers with high-quality conversational AI training data.
The Indonesian conversational speech corpus contains free conversations of over 800 native Indonesians, which are collected in indoor environment. Five Hours of the Indonesian conversational speech corpus is open-source in MagicHub. Click here to download.
Malay conversational speech corpus captures nearly 700 Malaysians’ free conversations in indoor environment. Five Hours of the Malay conversational speech corpus is open-source in the MagicHub.io. Click here to download.
Adhering to the spirit of "share, innovate and grow," MagicHub provides open-source conversational AI training data for the industry. Magic Data Tech currently has released more than 30 sets (nearly 1,000 hours) of open-source datasets in the community, including corpora of English, Spanish, Italian, Korean, Japanese, Chinese (Mandarin, Cantonese, Sichuan and Shanghai dialect, etc.), in-vehicle noise dataset, lexicon, and so on. At the same time, we welcome data owners to release datasets on MagicHub, to build a better ecology for open source.
As AI research and development is moving forward both in depth and breadth, the needs for structured data grow explosively. Meanwhile, the data labeling industry is undergoing decentralization: production of structured data is shifting from large-scale third party data processing centers to scattered data end-users.
The annual INTERSPEECH, held between August 30, and September 3, 2021, is a global conference organized by International Speech Communication Association (ISCA). The INTERSPEECH 2021 is held in hybrid form, that is participants can join the conference virtually online and physically in Brno, The Czech Republic.
Annotator® 5.0, an independently developed data annotation system, was official launched by Magic Data Tech on July 8, 2021 on World Artificial Intelligence Conference (WAIC) 2021, which is a global gathering and exchange of AI innovation ideas, technologies, and applications, held on Shanghai between July 7 and July 10.
Collecting and sorting out data have always been a time-consuming and tedious procedure for AI developers and researchers. Here we list 20 sites where high-quality data is ready and free, in hope of assisting you to locate the proper dataset for your AI modal in a better way.
A batch of datasets for conversational AI were newly release on MagicHub.io, our open-source community. Let’s have a quick look.