Company
blog
Blog
blog
22
Apr
18
Browse: 129
Open-source MagicData-RAMC: 180-hour Conversational Speech Dataset in Mandarin Released

Nowadays, spoken language processing technologies, including speech and speaker recognition, widely employ deep neural networks, and there is an increasing demand for speech training data. Over the past few years, developers in the speech community have freely released many high-quality speech corpora with training data of different sizes, accents, speaking styles, and recording environments in an attempt to cover a wide variety of speech scenarios. Among the many speech scenarios, the demand for spontaneous conversational speech recognition is also increasing, and the disfluencies such as colloquial expressions, hesitations, repetitions, and non-speech information in natural conversations pose a great challenge to the speech recognition task. At the same time, there is little training data in the numerous open-source speech corpora that match spontaneous conversational tasks, resulting in slow progress in speech recognition research in dialog scenarios.

In order to further enrich the open source speech corpus and promote the development of spoken language processing technology, Magic Data Technology Co., Ltd. releases 180-hour spontaneous conversational speech dataset in Mandarin to the speech community for free. MagicData-RAMC is a collection of high quality and richly annotated training data that can well support developers in their research related to speech recognition and speaker diarization. MagicData-RAMC is currently available for download at

https://magichub.com/datasets/magicdata-ramc/

At the same time, Magic Data Technology Co., Ltd., together with the Institute of Acoustics, Chinese Academy of Sciences, Shanghai Jiao Tong University and Northwestern Polytechnic University, has completed the research related to speech recognition, speaker diarization and keyword search based on MagicData-RAMC, which has been submitted to Interspeech 2022, the top conference in the field of speech. Preprint available on arxiv

https://arxiv.org/abs/2203.16844

Data Profile

MagicData-RAMC includes 351 sets of multi-turn Mandarin conversations with a total duration of 180 hours. The annotation information of each conversation includes transcribed text, voice activity timestamp, speaker information, recording information, and topic information. The speaker information includes gender, age, and geography, and the recording information includes environment and device.

Data Collection

MagicData-RAMC is collected indoors. The acoustic environments are small rooms under 20 square meters in area, and the reverberation time (RT60) is less than 0.4 seconds. The environments are relatively quiet during recording, with an ambient noise level lower than 40 dB(A). All participants are native and fluent Mandarin Chinese speakers with slight variations of accent.

The audios are recorded via an application developed by Magic Data Technology Co., Ltd. over mainstream smartphones, including Android phones and iPhones. The ratio of Android phones to iPhones is around 1 : 1. All recording devices work at 16 kHz, 16-bit to guarantee high recording quality.

All speech data are manually labeled using the platform built by Magic Data Technology Co., Ltd. Sound segments without semantic information during the conversations, including laughter, music, and other noise events, are annotated with specific symbols. Phenomena common in spontaneous communications, such as colloquial expressions, partial words, repetitions, and other speech disfluencies, are recorded and fully transcribed. Punctuation is carefully checked to ensure accuracy. MagicData-RAMC provides precise voice activity timestamps of each speaker, which is suitable for the research on speaker diarization. The transcriptions are proofed by professional inspectors to ensure the labeling and segmentation quality.

Data Distributions

In order to reflect real-world conversation scenarios as much as possible, MagicData-RAMC ensured a balanced gender and geographic distribution, as well as a diversity of topics during the collection process. There are 663 speakers in total in MagicData-RAMC, including 368 males and 295 females, 334 from the north and 329 from the south. The pie charts of gender, geography and province distribution are shown in Figure 1, Figure 2 and Figure 3.

Figure 1 Gender Distribution Chart in MagicData-RAMC

Figure 2 Region Distribution Chart in MagicData-RAMC

Figure 3 Province Distribution Chart in MagicData-RAMC

During the multi-turn conversations in the dataset, people respond flexibly to each other and continue the dialog with relevant questions and items instead of replying and waiting for the following questions rigidly. Therefore, every sample is a coherent and compact conversation centered around one theme, with history utterances and current utterance closely related. Higher-level information is maintained by contextual dialog flow across multiple sentences. Therefore, MagicData-RAMC is suitable for exploring speech processing techniques in dialog scenarios[6].

Baseline

A research team led by the Institute of Acoustics, Chinese Academy of Sciences has completed research related to speech recognition, keyword retrieval and speaker log based on the MagicData-RAMC dataset. Officially, the MagicData-RAMC dataset is divided into a 150-hour training set, a 10-hour development set, and a 20-hour test set. The baseline system is briefly described below.

In automatic speech recognition task, they used ESPnet[1] toolkit to train a Conformer[2] model. The training data includes 755h of MagicData-READ and 150h of MagicData-RAMC. MagicData-READ is available for download at

http://www.openslr.org/68

They achieved character error rates (CER) of 16.5% and 19.1% on the development set and test set, respectively.

In keyword search tasks, they retrieved 200 keywords, which is provided by MagicData-RAMC, based on the Conformer model and dynamic time alignment algorithm[3]. The precision and recall rates are 86.98% and 89.57% on the development set, and 85.87% and 88.79% on the test set.

In speaker diarization tasks, they used Kaldi[4] toolkit to build a speaker diarization system which includes speaker activity detection, speaker embedding extractor and Bayesian HMM clustering[5]. They achieved diarization error rates (DER) of 5.57% and 7.96% (collar 0.25) on the development set and test set, respectively.

Leaderboard

Magic Data Technology Co., Ltd., together with the Institute of Acoustics, Chinese Academy of Sciences and Jiangsu Normal University, held the Magic Data ASR-SD Challenge in 7-10, 2022. In order to help participants complete model development and training quickly and with high quality, the organizer provides basic scripts and baseline models for participants to use. There are two tracks, i.e. automatic speech recognition (ASR) accuracy in dialog scenarios and speaker diarization (SD) accuracy accuracy in dialog scenarios. Now, the test set used in the competition is released in MagicData-RAMC. Here are the competition results for the top six teams.

Table 1 Leaderboard of ASR Track in Magic Data ASR-SD Challenge

The leaderboard of SD track is temporarily empty since the test set published in Magic Data ASR-SD Challenge is different from that of MagicData-RAMC. We look forward to more developers to hit the leaderboard. Please chek the following website for details and feel free to contact us.

A following challenge on evaluating MagicData-RAMC will be launched soon. Stay updated at MagicHub open-source community.

Reference

[1] Watanabe S, Hori T, Karita S, et al. Espnet: End-to-end speech processing toolkit[J]. arXiv preprint arXiv:1804.00015, 2018.

[2] Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint arXiv:2005.08100, 2020.

[3] Yang R, Cheng G, Miao H, et al. Keyword Search Using Attention-Based End-to-End ASR and Frame-Synchronous Phoneme Alignments[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3202-3215.

[4] Povey D, Ghoshal A, Boulianne G, et al. The Kaldi speech recognition toolkit[C]//IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011 (CONF).

[5] Landini F, Profant J, Diez M, et al. Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks[J]. Computer Speech & Language, 2022, 71: 101254.

[6] Deng K, Cheng G, Miao H, et al. History Utterance Embedding Transformer LM for Speech Recognition[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 5914-5918.

Share
Previous
Page
What Conversational Data Play in the Future of Online Conferencing?
Next
Page
How to Start Your Machine Learning Projects with MagicData-RAMC?
Popular Tags
Latest Blogs
What Conversational Data Play in the Future of Online Conferencing?

Over two years into the pandemic, a lot of things have changed in the remote work landscape. As more jobs move to remote settings than ever before, the communication between coworkers and customers has shifted to that realm as well. With that shift comes a new set of trials and tribulations that didn’t exist in face-to-face meetings.

22
Apr
18
How to Start Your Machine Learning Projects with MagicData-RAMC?

As a collection of high quality and richly annotated training data, MagicData-RAMC is applicable to a series of research. This article will introduce 3 experiments related to speech recognition, speaker diarization and keyword search based on MagicData-RAMC conducted by Magic Data, together with the Institute of Acoustics, Chinese Academy of Sciences, Shanghai Jiao Tong University and Northwestern Polytechnic University.

22
Apr
18
Magic Data Launches Conversational AI Datasets for Machine Learning

Magic Data launches an accumulation of more than 200,000 hours of training datasets, including 140,000 hours of conversational AI training datasets and 60,000 hours of read speech datasets, covering Asian languages, English dialects, and European languages, boosting the rapid development of human-computer interaction in artificial intelligence.

22
Apr
18
Moving Toward the Globe | Magic Data Builds Partnership with AWS, Empowering AI Data Processing

Recently, Magic Data officially become one of AWS’s ISV partners after Annotator® 5.0, an AI-assisted data labeling platform passing the ASW foundation technology Review (FTR).

22
Apr
18
Data Security and Compliance in Deploying AI — Magic Data’s Data Security Commitment

The importance of data security has been increasingly realized, no matter it is in national or personal level. Always putting data security at the first priority, Magic Data designs and applies a strict data protection mechanism so as to provide sufficient trusted AI training data for the industry.

22
Apr
18
Sales Department
Please fill in this form to purchase datasets or quote for
data collection/ annotation services.
Name
*
Company Name
*
Title
*
Email
*
Phone Number
*

Country code + Phone Number

Phone number-e.g. +86 138315xxxxx

Detail
Country
City
Submit
Sales Department
Please fill in this form and we will contact you soon
Name
*
Company Name
*
Email
*
Phone Number
*

Country code + Phone Number

Phone number-e.g. +86 138315xxxxx

Detail
Country
City
Submit
Resources Department
If you want to be our data collection and annotation team
member, please fill in this form.
DATA COLLECTION PROJECTS
Language*
Location*
DATA ANNOTATION PROJECTS
Language*
CONTACT INFORMATION
Name*
Company Name*
E-mail*
Phone Number*

Country code + Phone Number

Phone number-e.g. +86 138315xxxxx

Experience*
Address*
Submit
Marketing Department
If you want to forward our article or tell us marketing
events, please fill in this form.
Name
*
Company Name
*
Email
*
Phone Number
*

Country code + Phone Number

Phone number-e.g. +86 138315xxxxx

Detail
Submit
Human Resources Department
Please fill in this form to be a member of Magic Data Tech.
Name
*
Email
*
Phone Number
*

Country code + Phone Number

Phone number-e.g. +86 138315xxxxx

Job
*
Upload Resume
Submit
Sample Download
Name*
E-mail*
Phone Number*

Country code + Phone Number

Phone number-e.g. +86 138315xxxxx

Company Name*
Job
Department
Company Product
I am also interested in the following data:
Languages
Style
Scenario

We will contact you via telephone to confirm your information and provide the method to download.
Submit
Submission Successful!
We will contact you as soon as possible.
This page would be
closed in 3 seconds automatically.
Talk to Magic Data
>
TOP