Magic Data

Nowadays, spoken language processing technologies, including speech and speaker recognition, widely employ deep neural networks, and there is an increasing demand for speech training data. Over the past few years, developers in the speech community have freely released many high-quality speech corpora with training data of different sizes, accents, speaking styles, and recording environments in an attempt to cover a wide variety of speech scenarios. Among the many speech scenarios, the demand for spontaneous conversational speech recognition is also increasing, and the disfluencies such as colloquial expressions, hesitations, repetitions, and non-speech information in natural conversations pose a great challenge to the speech recognition task. At the same time, there is little training data in the numerous open-source speech corpora that match spontaneous conversational tasks, resulting in slow progress in speech recognition research in dialog scenarios.

In order to further enrich the open source speech corpus and promote the development of spoken language processing technology, Magic Data Technology Co., Ltd. releases 180-hour spontaneous conversational speech dataset in Mandarin to the speech community for free. MagicData-RAMC is a collection of high quality and richly annotated training data that can well support developers in their research related to speech recognition and speaker diarization. MagicData-RAMC is currently available for download at

https://magichub.com/datasets/magicdata-ramc/

At the same time, Magic Data Technology Co., Ltd., together with the Institute of Acoustics, Chinese Academy of Sciences, Shanghai Jiao Tong University and Northwestern Polytechnic University, has completed the research related to speech recognition, speaker diarization and keyword search based on MagicData-RAMC, which has been submitted to Interspeech 2022, the top conference in the field of speech. Preprint available on arxiv

https://arxiv.org/abs/2203.16844

Data Profile

MagicData-RAMC includes 351 sets of multi-turn Mandarin conversations with a total duration of 180 hours. The annotation information of each conversation includes transcribed text, voice activity timestamp, speaker information, recording information, and topic information. The speaker information includes gender, age, and geography, and the recording information includes environment and device.

Data Collection

MagicData-RAMC is collected indoors. The acoustic environments are small rooms under 20 square meters in area, and the reverberation time (RT60) is less than 0.4 seconds. The environments are relatively quiet during recording, with an ambient noise level lower than 40 dB(A). All participants are native and fluent Mandarin Chinese speakers with slight variations of accent.

The audios are recorded via an application developed by Magic Data Technology Co., Ltd. over mainstream smartphones, including Android phones and iPhones. The ratio of Android phones to iPhones is around 1 : 1. All recording devices work at 16 kHz, 16-bit to guarantee high recording quality.

All speech data are manually labeled using the platform built by Magic Data Technology Co., Ltd. Sound segments without semantic information during the conversations, including laughter, music, and other noise events, are annotated with specific symbols. Phenomena common in spontaneous communications, such as colloquial expressions, partial words, repetitions, and other speech disfluencies, are recorded and fully transcribed. Punctuation is carefully checked to ensure accuracy. MagicData-RAMC provides precise voice activity timestamps of each speaker, which is suitable for the research on speaker diarization. The transcriptions are proofed by professional inspectors to ensure the labeling and segmentation quality.

Data Distributions

In order to reflect real-world conversation scenarios as much as possible, MagicData-RAMC ensured a balanced gender and geographic distribution, as well as a diversity of topics during the collection process. There are 663 speakers in total in MagicData-RAMC, including 368 males and 295 females, 334 from the north and 329 from the south. The pie charts of gender, geography and province distribution are shown in Figure 1, Figure 2 and Figure 3.

Figure 1 Gender Distribution Chart in MagicData-RAMC

Figure 2 Region Distribution Chart in MagicData-RAMC

Figure 3 Province Distribution Chart in MagicData-RAMC

During the multi-turn conversations in the dataset, people respond flexibly to each other and continue the dialog with relevant questions and items instead of replying and waiting for the following questions rigidly. Therefore, every sample is a coherent and compact conversation centered around one theme, with history utterances and current utterance closely related. Higher-level information is maintained by contextual dialog flow across multiple sentences. Therefore, MagicData-RAMC is suitable for exploring speech processing techniques in dialog scenarios[6].

Baseline

A research team led by the Institute of Acoustics, Chinese Academy of Sciences has completed research related to speech recognition, keyword retrieval and speaker log based on the MagicData-RAMC dataset. Officially, the MagicData-RAMC dataset is divided into a 150-hour training set, a 10-hour development set, and a 20-hour test set. The baseline system is briefly described below.

In automatic speech recognition task, they used ESPnet[1] toolkit to train a Conformer[2] model. The training data includes 755h of MagicData-READ and 150h of MagicData-RAMC. MagicData-READ is available for download at

http://www.openslr.org/68

They achieved character error rates (CER) of 16.5% and 19.1% on the development set and test set, respectively.

In keyword search tasks, they retrieved 200 keywords, which is provided by MagicData-RAMC, based on the Conformer model and dynamic time alignment algorithm[3]. The precision and recall rates are 86.98% and 89.57% on the development set, and 85.87% and 88.79% on the test set.

In speaker diarization tasks, they used Kaldi[4] toolkit to build a speaker diarization system which includes speaker activity detection, speaker embedding extractor and Bayesian HMM clustering[5]. They achieved diarization error rates (DER) of 5.57% and 7.96% (collar 0.25) on the development set and test set, respectively.

Leaderboard

Magic Data Technology Co., Ltd., together with the Institute of Acoustics, Chinese Academy of Sciences and Jiangsu Normal University, held the Magic Data ASR-SD Challenge in 7-10, 2022. In order to help participants complete model development and training quickly and with high quality, the organizer provides basic scripts and baseline models for participants to use. There are two tracks, i.e. automatic speech recognition (ASR) accuracy in dialog scenarios and speaker diarization (SD) accuracy accuracy in dialog scenarios. Now, the test set used in the competition is released in MagicData-RAMC. Here are the competition results for the top six teams.

Table 1 Leaderboard of ASR Track in Magic Data ASR-SD Challenge

The leaderboard of SD track is temporarily empty since the test set published in Magic Data ASR-SD Challenge is different from that of MagicData-RAMC. We look forward to more developers to hit the leaderboard. Please chek the following website for details and feel free to contact us.

A following challenge on evaluating MagicData-RAMC will be launched soon. Stay updated at MagicHub open-source community.

Reference

[1] Watanabe S, Hori T, Karita S, et al. Espnet: End-to-end speech processing toolkit[J]. arXiv preprint arXiv:1804.00015, 2018.

[2] Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint arXiv:2005.08100, 2020.

[3] Yang R, Cheng G, Miao H, et al. Keyword Search Using Attention-Based End-to-End ASR and Frame-Synchronous Phoneme Alignments[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3202-3215.

[4] Povey D, Ghoshal A, Boulianne G, et al. The Kaldi speech recognition toolkit[C]//IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011 (CONF).

[5] Landini F, Profant J, Diez M, et al. Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks[J]. Computer Speech & Language, 2022, 71: 101254.

[6] Deng K, Cheng G, Miao H, et al. History Utterance Embedding Transformer LM for Speech Recognition[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 5914-5918.

News

Open-source MagicData-RAMC: 180-hour Conversational Speech Dataset in Mandarin Released

Get Started?