We are honored to say that our Chinese Mandarin Conversational Speech was selected in LDC Catalog! The catalog No. is LDC2019S23 (You can search at https://catalog.ldc.upenn.edu/LDC2019S23 metadata standard.
New trends for conversational datasets AI continue to expand into new use cases and new verticals. As the leading companies such as Google, Amazon pay more attention to continuous conversation, the importance of conversational datasets increases. Besides, the accuracy of read speech data recognition is up to 97-98%, but in conversational speech recognition, the accuracy is nearly 50% (referred to results of the CHiME-5 Challenge). This large gap means the challenge in automatic speech recognition (ASR) extend to a new phase.
This is an excellent testing dataset for conversational speech recognition models There are three main characteristics in this corpus for data collection, annotation and application. For data collection, the key word is diversity. These data are collected to cover conversations recorded in different accents and transmission channels, with speakers in different ages and genders and with a background noise corresponding to the scenario.
1) Speakers: 60 speakers from different areas in China, with age range from 4 to 67.
2) Recording environment: 3 rooms with different reverberation
3) Recording equipment: Android device (9 varieties); iOS device (8 varieties); recorder (2 varieties)
4) Recording channels: single-channel and multi-channel
5) The corpus consists both far-field and near-field voice.
For data annotation, the key word is accuracy. The annotation work is compliance to strict specifications and documentations and completed by trained annotators. Our team has formulated a series of tagging rules to meet actual needs. What does it mean? Spontaneous conversation produces overlapping, pause, cough, and clapping. These sounds are meaningful in some conditions as they may indicate the speaker's state, mood, and even hint at the speaker's mental activities. According to the company’s advanced annotation specifications, these sounds could be recognizable by AI systems.
The last key word variety is for data application. This corpus is valuable for at least 3 applications: conversational speech recognition, speaker separation and robustness testing.
1) Accuracy testing of various speech recognition models. For example, in a typical family application scenario, the family members using voice interaction include the elderly, the wife (adult female), the husband (adult male), and the children. These family members have different pronunciation patterns and habits. In the speech recognition model, the age diversity of the corpus can be used to test the recognition effect of the model for different age groups.
2) Accuracy testing of speaker separation.Scene recognition based on specific speaker has become a research hotspot. In the collection, there are both single-player recording channel and multi-player recording channel. Therefore, this dataset can be used to test the accuracy for speaker separation tasks.
3) Robustness testing of the model. Since there are far-field and near-field voice recorded at the same time, different audio contains different reverberation and background noise. The corpus was valuable for researchers to test the robustness of their systems.
1) Spontaneous conversational data generates various responses and accordance to real life scenarios;
2) Annotation norms that can be customized to meet actual needs;
3) Strict quality management system, ensuring a continuous output of high-quality data products.
This corpus is a part of our whole databases. Magic Data Technology owns 100,000+ hours self-owned copyright datasets which can be used to improve the performance of models rapidly. If you are interested in our datasets or our data services, don’t hesitate to contact us via firstname.lastname@example.org.
Annotator® 5.0, an independently developed data annotation system, was official launched by Magic Data Tech on July 8, 2021 on World Artificial Intelligence Conference (WAIC) 2021, which is a global gathering and exchange of AI innovation ideas, technologies, and applications, held on Shanghai between July 7 and July 10.
Collecting and sorting out data have always been a time-consuming and tedious procedure for AI developers and researchers. Here we list 20 sites where high-quality data is ready and free, in hope of assisting you to locate the proper dataset for your AI modal in a better way.
A batch of datasets for conversational AI were newly release on MagicHub.io, our open-source community. Let’s have a quick look.
Recently, Beijing Municipal Bureau of Economy and Information Technology officially released the list of “First Group of the Top Specialized, Fine, Special and Innovative company”. Magic Data Tech was named in the list for its professional and innovative services in the field of AI data services.
We are proud to announce Magic Data Tech has been named the “Best Supplier of Alibaba Cloud 2021”.