Company
Blog
blog
20
Sep
18
Browse: 653
No. LDC2019S23, Magic Data Chinese Mandarin Conversational Speech was selected into LDC Catalog

We are honored to say that our Chinese Mandarin Conversational Speech was selected in LDC Catalog! The catalog No. is LDC2019S23 (You can search at https://catalog.ldc.upenn.edu/LDC2019S23 metadata standard.

New trends for conversational datasets AI continue to expand into new use cases and new verticals. As the leading companies such as Google, Amazon pay more attention to continuous conversation, the importance of conversational datasets increases. Besides, the accuracy of read speech data recognition is up to 97-98%, but in conversational speech recognition, the accuracy is nearly 50% (referred to results of the CHiME-5 Challenge). This large gap means the challenge in automatic speech recognition (ASR) extend to a new phase.

This is an excellent testing dataset for conversational speech recognition models There are three main characteristics in this corpus for data collection, annotation and application. For data collection, the key word is diversity. These data are collected to cover conversations recorded in different accents and transmission channels, with speakers in different ages and genders and with a background noise corresponding to the scenario.

1) Speakers: 60 speakers from different areas in China, with age range from 4 to 67.

2) Recording environment: 3 rooms with different reverberation

3) Recording equipment: Android device (9 varieties); iOS device (8 varieties); recorder (2 varieties)

4) Recording channels: single-channel and multi-channel

5) The corpus consists both far-field and near-field voice.

For data annotation, the key word is accuracy. The annotation work is compliance to strict specifications and documentations and completed by trained annotators. Our team has formulated a series of tagging rules to meet actual needs. What does it mean? Spontaneous conversation produces overlapping, pause, cough, and clapping. These sounds are meaningful in some conditions as they may indicate the speaker's state, mood, and even hint at the speaker's mental activities. According to the company’s advanced annotation specifications, these sounds could be recognizable by AI systems.

The last key word variety is for data application. This corpus is valuable for at least 3 applications: conversational speech recognition, speaker separation and robustness testing.

1) Accuracy testing of various speech recognition models. For example, in a typical family application scenario, the family members using voice interaction include the elderly, the wife (adult female), the husband (adult male), and the children. These family members have different pronunciation patterns and habits. In the speech recognition model, the age diversity of the corpus can be used to test the recognition effect of the model for different age groups.

2) Accuracy testing of speaker separation.Scene recognition based on specific speaker has become a research hotspot. In the collection, there are both single-player recording channel and multi-player recording channel. Therefore, this dataset can be used to test the accuracy for speaker separation tasks.

3) Robustness testing of the model. Since there are far-field and near-field voice recorded at the same time, different audio contains different reverberation and background noise. The corpus was valuable for researchers to test the robustness of their systems.

Benefits

1) Spontaneous conversational data generates various responses and accordance to real life scenarios;

2) Annotation norms that can be customized to meet actual needs;

3) Strict quality management system, ensuring a continuous output of high-quality data products.

This corpus is a part of our whole databases. Magic Data Technology owns 100,000+ hours self-owned copyright datasets which can be used to improve the performance of models rapidly. If you are interested in our datasets or our data services, don’t hesitate to contact us via business@magicdatatech.com.

Share
Previous
Page
Magic Data Tech announced Launch of Annotator® 5.0, An AI-Assisted Data Annotation Platform
Next
Page
Worthy of Bookmarking! | 20 Websites to get Free Datasets for Your AI Model Training
Popular Tags
Latest Blogs
Magic Data Tech announced Launch of Annotator® 5.0, An AI-Assisted Data Annotation Platform

Annotator® 5.0, an independently developed data annotation system, was official launched by Magic Data Tech on July 8, 2021 on World Artificial Intelligence Conference (WAIC) 2021, which is a global gathering and exchange of AI innovation ideas, technologies, and applications, held on Shanghai between July 7 and July 10.

20
Sep
18
Worthy of Bookmarking! | 20 Websites to get Free Datasets for Your AI Model Training

Collecting and sorting out data have always been a time-consuming and tedious procedure for AI developers and researchers. Here we list 20 sites where high-quality data is ready and free, in hope of assisting you to locate the proper dataset for your AI modal in a better way.

20
Sep
18
New Updates on MagicHub.io – Free to Download Datasets for over 190 Hours!

A batch of datasets for conversational AI were newly release on MagicHub.io, our open-source community. Let’s have a quick look.

20
Sep
18
Magic Data Tech Was Qualified as “Top Specialized, Fine, Distinctive and Innovative Company” by Beijing Municipal Bureau of Economy and Information Technology

Recently, Beijing Municipal Bureau of Economy and Information Technology officially released the list of “First Group of the Top Specialized, Fine, Special and Innovative company”. Magic Data Tech was named in the list for its professional and innovative services in the field of AI data services.

20
Sep
18
Good News! Magic Data Tech Wins “Best Supplier of Alibaba Cloud 2021”

We are proud to announce Magic Data Tech has been named the “Best Supplier of Alibaba Cloud 2021”.

20
Sep
18
Sales Department
Please fill in this form to purchase datasets or quote for
data collection/ annotation services.
Name
*
Company Name
*
Email
*
Phone Number
*
Detail
Country
City
Submit
Resources Department
If you want to be our data collection and annotation team
member, please fill in this form.
DATA COLLECTION PROJECTS
Language*
Location*
DATA ANNOTATION PROJECTS
Language*
CONTACT INFORMATION
Name*
Company Name*
E-mail*
Phone Number*
Experience*
Address*
Submit
Marketing Department
If you want to forward our article or tell us marketing
events, please fill in this form.
Name
*
Company Name
*
Email
*
Phone Number
*
Detail
Submit
Human Resources Department
Please fill in this form to be a member of Magic Data Tech.
Name
*
Email
*
Phone Number
*
Job
*
Upload Resume
Submit
Sample Download
Name*
E-mail*
Phone Number*
Company Name*
Job
Department
Company Product
I am also interested in the following data:
Languages
Style
Scenario

We will contact you via telephone to confirm your information and provide the method to download.
Submit
Submission Successful!
We will contact you as soon as possible.
This page would be
closed in 3 seconds automatically.
>
TOP