Nowadays, a new round of technological revolution and industrial transformation is rising, which promote the rapid evolution of digital technology represented by artificial intelligence, and human beings are marching into an intelligent society. According to the White Paper on the Core Technology Industry of Artificial Intelligence, released by the China Academy of Information and Communications in April this year, AI has fully covered the basic elements of social operation and improved overall operating efficiency. In the future, AI will be as ubiquitous as water and electricity, disrupting and transforming every industry.
Data play an important role in supporting the development of artificial intelligence. Artificial intelligence model needs massive data to train and optimize. Among the three core elements, data, algorithm and arithmetic power, the focus is changing from algorithm to data. Data sets the upper bound for machine learning, and only when data is valued by developers, more accurate models can be trained. Enda Wu, a leading machine-learning expert, argues that machine learning will evolve rapidly if more emphasis is placed on data rather than model.
755 Hours of Mandarin Chinese Speech Corpora newly arrive on MagicHub
To drive the development of artificial intelligence technology, Magic Data Tech launched the open-source community, MagicHub (https://magichub.io/), releasing large amounts of data for developers around the world. Recently, Magic Data Tech continues to release a batch of 755 hours of Mandarin Chinese Speech Corpus in the community. The corpora was previously open-source at OpenSLR, and will be linked to the community for free download. Click here to download.
The 755-hour Mandarin Chinese speech corpus, which has a total duration of 10566.9 hours, has been instrumental in the Exploring Methods for the Automatic Detection of Errors in Manual Transcription, a research project by Language and Speech Processing Centre of Johns Hopkins University.
Indonesian and Malay Conversational Speech Corpora
Indonesian and Malay conversational speech corpora are released in MagicHub community recently by Magic Data Tech, providing developers with high-quality conversational AI training data.
The Indonesian conversational speech corpus contains free conversations of over 800 native Indonesians, which are collected in indoor environment. Five Hours of the Indonesian conversational speech corpus is open-source in MagicHub. Click here to download.
Malay conversational speech corpus captures nearly 700 Malaysians’ free conversations in indoor environment. Five Hours of the Malay conversational speech corpus is open-source in the MagicHub.io. Click here to download.
Adhering to the spirit of "share, innovate and grow," MagicHub provides open-source conversational AI training data for the industry. Magic Data Tech currently has released more than 30 sets (nearly 1,000 hours) of open-source datasets in the community, including corpora of English, Spanish, Italian, Korean, Japanese, Chinese (Mandarin, Cantonese, Sichuan and Shanghai dialect, etc.), in-vehicle noise dataset, lexicon, and so on. At the same time, we welcome data owners to release datasets on MagicHub, to build a better ecology for open source.