Artificial intelligence requires huge volume of data to be trained. For some AI companies and researchers, data can be difficult and time consuming to collect. Open-source data can help mitigate these challenges and boost the development of AI.
MAGICDATA Mandarin Chinese Read Speech Corpus was developed by MAGIC DATA TECHNOLOGY Co., Ltd. and freely published for non-commercial use. The corpus consists of 755 hours of scripted read speech data by 1000 native speakers of the Mandarin Chinese spoken in mainland China.
Japanese Read Speech Recognition Corpus was developed by MAGICDATA TECHNOLOGY Co., Ltd. with a significant volume of 1500 hours. A subset of 30-hour scripted read speech data was developed and freely published for non-commercial use. 37 native speakers are from different areas, including Tokyo, Osaka, Hokkaido, etc. The corpus is a test set, recorded indoors and the output is PCM formatted. The recording texts are from daily conversation.
MAGICDATA Kid Voice TTS Corpus in Mandarin Chinese was recorded by a four-year-old Chinese girl originally born in Beijing China. This time we published 15-minute speech data from the corpus for non-commercial use.