In the wave of artificial intelligence advancements, speech dialogue has become a pivotal element of human-computer interaction, gradually serving as a bridge between the digital world and real life. With the emergence of advanced language models such as GPT-4o and Gemini Live, speech dialogue technology is experiencing unprecedented growth opportunities. These models not only exhibit robust capabilities in language understanding and generation, but have also achieved notable advancements in cross-modal integration, multi-domain adaptability, and real-time responsiveness, offering substantial technical support for the development of more intelligent and efficient multilingual speech dialogue systems.
The 2024 Interspeech conference, as the leading international event in the field of speech processing, once again spotlighted the latest advancements and emerging trends in dialogue system technologies. Data from the conference indicates a notable increase in the number of papers focused on speech recognition, dialogue systems, and speech interaction, with research on speech dialogue becoming a central area of interest for attendees. This reflects the growing industry attention to these emerging technological directions and suggests that future speech technologies will increasingly emphasize natural interaction, multilingual capabilities, and the handling of complex dialogue scenarios. Additionally, advancements in speech understanding (ASR) and speech synthesis (TTS), as core components of speech technologies, are closely linked to the overall performance and user experience of dialogue systems.
However,The development of multilingual speech dialogue technology remains challenging. It requires systems not only to accurately perform speech recognition and translation across different languages but also to comprehend contextual nuances within conversations, while accounting for cultural variations and linguistic conventions. Although many research institutions and enterprises have made preliminary progress in this area, the technology is still in a phase of ongoing exploration and refinement. Significant technical hurdles persist, particularly in handling complex multilingual environments, improving recognition accuracy and depth of comprehension, and enhancing the fluency of dialogue interactions.
To advance the development of speech dialogue technology and enhance user interaction through greater convenience, naturalness, and intelligence, we have initiated the "Large-scale Multilingual Speech Recognition Challenge." This competition focuses on three languages: Chinese, Japanese and Korean. Our goal is to collaborate with industry, academia, and research institutions to explore the latest trends and directions in multilingual speech dialogue technology, fostering innovation and driving technological advancements in this field.
Participants are free to choose their basic training data from various open or proprietary datasets, with the organizers recommending the following open-source datasets. While there are no restrictions on the large language models participants may use, such as Llama or Qwen, the acoustic model must be fine-tuned on Whisper v3, using part or all of the data provided below.
· Common Voice:
Common Voice is a global open-source project initiated by the Mozilla Foundation to collect speech data across
multiple languages. It has created a substantial public-domain dataset, including 236 hours of Chinese, 126
hours of Japanese, and 1 hour of Korean.
Data link:
https://commonvoice.mozilla.org/en/datasets
· WenetSpeech:
This dataset is a corpus comprising over 10,000 hours of richly annotated, cross-domain Chinese speech data,
sourced from YouTube and podcasts. OCR (Optical Character Recognition) and ASR (Automatic Speech Recognition)
technologies were used to generate speech-to-text annotations for each YouTube video and podcast recording. To
further enhance the dataset's quality, a novel end-to-end label error detection method was employed to validate
and filter the data.
Data link: https://wenet.org.cn/WenetSpeech/
· reazonspeech-all-v2:
This dataset contains a wide range of natural Japanese speech collected from terrestrial TV streams. It includes
over 35,000 hours of audio. The audio files are provided in FLAC format with a sampling rate of 16kHz, and each
file is accompanied by a transcription.
Data link: https://huggingface.co/datasets/reazon-research/reazonspeech
· MSR-86K:
MSR-86K is a large, continually expanding multilingual corpus designed for speech recognition research. It is
sourced from public YouTube videos and contains 86,300 hours of transcribed ASR data across 15 languages. For
this competition, we have selected the Korean-language portion of the corpus as the training dataset, totaling
10,338.66 hours. MSR-86K does not hold the copyright to the audio; the copyright remains with the original
content owners, and public URLs to the original videos or audio are provided.
Data link: https://huggingface.co/datasets/Alex-Song/MSR-86K
In this competition, the development and test sets required by participants will be created and provided by Magic Data.
· Natural Multi-turn Conversational Speech Dataset: This dataset covers three languages—Chinese, Japanese, and Korean—each with 20 hours of data, equally divided into 10-hour development and test sets. The dataset features natural dialogue style with diverse content across multiple domains, including multiple speakers, various accents, and a range of noisy environments. The natural dialogue expressions accurately reflect real-world language usage, enabling models to adapt to different speaking habits and accents, thus improving recognition performance in complex speech environments and enhancing robustness and generalizability.
Magic Data is committed to designing and producing high-quality conversational datasets. For additional high-quality conversational datasets in other languages, please visit the official website at https://www.magicdatatech.com.
September 27, 2024 : Registration opens
October 18, 2024 : Development set release https://magicdata.blob.core.chinacloudapi.cn/opensource/LaMuCo%202024%20Development%20set.zip?sv=2020-04-08&st=2024-10-21T01%3A54%3A46Z&se=2024-11-30T01%3A54%3A00Z&sr=b&sp=r&sig=kw2LjCMJ5lL7qetyDFZIx53B3QGmCwrG6F1kDi01trU%3D
November 1, 2024 : Registration deadline
November 8, 2024 : Test set release
November 20, 2024 : Submission deadline for system description, model, and code
November 30, 2024 : Results announcement
This challenge aims to optimize Automatic Speech Recognition (ASR) results using large language models (LLMs). Participants can employ two methods:
· Extract embeddings from the ASR model and input them directly into LLMs to achieve high-accuracy recognition results.
· Use LLMs to post-process the original N-best hypotheses output from the ASR model to improve recognition results.
In both methods, participants can use techniques like Low-Rank Adaptation (LoRA) to fine-tune the LLMs, effectively utilizing data without updating the entire model.
The competition is open to the public, welcoming individuals from universities, research institutions, and other sectors. Participants must complete the following Google form by November 1:
Notice: As this form is created using Google Forms, please ensure you have access to www.google.com .
We encourage participation from individuals and teams in both academia and industry, with each team limited to a maximum of 5 members.
If you encounter any issues during the registration process, please contact lisheng.cs@gmail.com for assistance.
Three award-winning teams or individuals will be selected in this competition, and award certificates will be granted.
Participants need to submit system description, model and code for verification.
Scoring: Overall performance is often judged based on metrics like CER.
CER%: Character Error Rate measures transcription errors at the character level, important for languages like Chinese.
All participants must adhere to the following rules:
· The datasets provided during the competition are for research purposes only, both during and after the event, and may not be used for commercial purposes.
· The organizers retain the final authority on interpretation and reserve the right to modify the rules. In special circumstances, the organizers will be responsible for coordinating and interpreting the rules.
M3Oriental:Workshop of Multimodal, Multilingual and Multitask Modeling Technologies for OrientalLanguages
M3Oriental Members:
Qingqing Zhang,Beijing Magic Data Technology Co., Ltd
Sheng Li,National Institute of Information and Communications Technology (NICT)
Raj Dabre,National Institute of Information and Communications Technology (NICT)
Jiyi Li,University of Yamanashi
Chenhui Chu,Kyoto University
Bei Liu,Microsoft Research Asia
Zhao Ren,University of Bremen
Ruili Wang,Massey University
Zuchao Li,Wuhan University
Email: For inquiries related to the competition, please send an email to lisheng.cs@gmail.com
Scan the QR code to join the official WeChat group :