See What's NEW

Conversational Voice Clone Challenge (CoVoC)

ISCSLP2024 Grand Challenge

Call for Participation

Text-to-speech (TTS) aims to produce speech that sounds as natural and human-like as possible. Recent advancements in neural speech synthesis have significantly enhanced the quality and naturalness of generated speech, leading to widespread applications of TTS systems in real-world scenarios. A notable breakthrough in the field is witnessed in zero-shot TTS, driven by expanded datasets and new approaches (e.g.: decoder-only paradigms), attracting extensive attention from academia and industry. However, these advancements haven't been sufficiently investigated to address challenges in spontaneous and conversational contexts. Specifically, the primary challenge lies in effectively managing prosody details in the generated speech, which is attributed to the diverse and intricate spontaneous behaviors that differentiate spontaneous speech from read speech.

Large-scale TTS systems, with their in-context learning ability, have the potential to yield promising outcomes in the mentioned scenarios. However, a prevalent challenge in the field of large-scale zero-shot TTS is the lack of consistency in training and testing datasets, along with a standardized evaluation benchmark. This issue hinders direct comparisons and makes it challenging to assess the performance of various systems accurately.

To promote the development of expressive spontaneous-style speech synthesis in the zero-shot scenario, we are launching the Conversational Voice Clone Challenge (CoVoC). This challenge encompasses a range of diverse training datasets, including the 10K-hour WenetSpeech4TTS dataset, 180 hours of Mandarin spontaneous conversational speech data, and 100 hours of high-quality spoken conversations. Furthermore, a standardized testing dataset, accompanied by carefully designed text, will be made available. The goal of providing these sizable and standardized datasets is to establish a comprehensive benchmark.

Data

Organizers will provide participants with four audio/text datasets at different stages. All audio data will be in mono-channel WAV format, with a sampling rate of 16KHz and 16-bit encoding. The datasets will include rich transcripts and will be in the Mandarin language.

During the training phase, participants will have access to the large-scale WenetSpeech4TTS dataset, as well as two smaller-scale datasets, namely MAGICDATA Mandarin Chinese Conversational Speech Corpus and HQ-Conversations. Participants can freely utilize these datasets for training or fine-tuning purposes.

· WenetSpeech4TTS: a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained corpus contains 12,800 hours of paired audio-text data.
Download: https://huggingface.co/datasets/Wenetspeech4TTS/WenetSpeech4TTS

· MAGIC DATA Mandarin Chinese Conversational Speech Corpus: 180 hours of mobile-recorded speech data. 663 speakers from different accent areas in China are invited to participate in the recording. Recordings are conducted in a quiet indoor environment. All speech data are manually labeled and professional inspectors proof the transcriptions to ensure the labeling quality.
Download: https://www.openslr.org/123/

· HQ-Conversations(provided by Magic Data): This dataset comprises 100 hours of data featuring 200 speakers, including 75 males and 125 females, covering major cities in China such as Henan, Jiangxi, Guangdong, Shandong, Hebei, and others. The language of the data is Chinese Mandarin, and the content consists of segmented conversations. The storage formats are TXT for text data and WAV for audio data. The conversations closely reflect everyday life scenarios, characterized by natural and expressive language, rather than a scripted or read-aloud style. This dataset has undergone rigorous screening and verification to ensure high accuracy and consistency.

During the evaluation phase, participants will be required to test the zero-shot performance of their models on the Clone-Speaker dataset, using the target texts from Test-Text

· Clone-Speaker: We will provide 20 speakers with several seconds of prompt audio for each person for challenge evaluation . The test set is released at challenge evaluation stage for participants to perform zero-shot speech synthesis. The method of obtaining datasets will be communicated via email.

· Test-Text: We will provide comprehensive test texts, including normal test texts, dialogue texts with spontaneous behaviors, etc.The method of obtaining datasets will be communicated via email.

Track Setting

The CoVoC challenge features two tracks.

Constrained Track: Only the corpora mentioned above are allowed to be used in the training phase. If a pretrained model is utilized in the system, it must be an open-source model, and this information must be clearly indicated in the final submission. Top-ranked teams in this track will be invited to submit papers to the ISCSLP 2024 conference.

Unconstrained Track: Besides the data we provide, other open-source or internal corpora are allowed to be used in the training phase. Participants should clearly describe the data used in the technical report associated with the submission. Teams participating in this track will be required to fill out a form to provide details about their approaches.

Rules

All participants should adhere to the following rules to be eligible for the challenge.

· Datasets released in this challenge are strictly allowed to be used by the participants during the challenge and after the challenge for research purposes only. Commercial use is not allowed. The organizer will open-source the data after the challenge, and the use of the data should obey the corresponding open-source license.

· Publicly available pretrained models can be used in Track 1, while extra training data can only be used in track 2.

· The right of final interpretation belongs to the organizer. In case of special circumstances, the organizer will coordinate the interpretation.

Evaluation

1 Subjective Evaluation

We will conduct mean opinion score (MOS) tests to assess speech quality, speech naturalness, speaker similarity and speech spontaneous style. Some professional listener teams will participate in the evaluation. The challenge organizers will conduct subjective evaluations in terms of the following criteria:

1)Speech naturalness: In each session, listeners will be arranged to listen to each sample and select a score from 1 [Completely no naturalness] to 5 [Completely high naturalness].

2)Speech quality: In each session, listeners will listen to one sample and choose a score representing how natural or unnatural the speech sounded on a scale of 1 [Completely unnatural] to 5 [Completely natural].

3)Speaker similarity: In each session, listeners will listen to two reference samples of the original speaker and one synthetic sample. They will choose a response that represents how similar the speaker identity of synthetic speech sounds to that of the voice in the reference samples on a scale from 1 [Sounds like a different voice] to 5 [Sounds like the same voice].

4)Speech spontaneous style: In this test, we may use text with spontaneous behavior for testing. In each session, listeners will listen to one sample and choose a score on a scale of 1 [Spontaneous behavior in speech works bad] to 5 [Spontaneous behavior in speech works well].

2 Objective Evaluation

1)Character Error Rate (CER): The CER is measured between transcripts of real and synthesized utterances transcribed.

2)Speaker Embedding Cosine Similarity (SECS): The SECS metric is computed by extracting speaker embeddings and calculating the cosine similarity.

Note that the objective evaluation results will be conducted and released to all teams with submissions. The subjective evaluation will only be conducted to the top N submissions with high objective scores.

Timeline

June 3, 2024 : HQ-Conversations data release

June 10, 2024 : Baseline system release

June 30, 2024 : Evaluation stage starts; Clone-Speaker and Test-Text data release; deadline for challenge registration

July 2, 2024 : Evaluation ends; Test audio and system description submission deadline

July 12, 2024 : Evaluation results release to participants

July 20, 2024 : Deadline for ISCSLP2024 paper submission (only for invited teams)

Guidelines

To register for the ISCSLP CoVoC Challenge, potential participants are required to fill out the following Google Form before or by June 30, 2024:

Sign up

Notice: As this form is created using Google Forms, please ensure you have access to www.google.com .

Teams from both academia and industry are welcome.

If you encounter any issues during the registration process, please contact xkx@mail.nwpu.edu.cn for assistance.

Organizers

Lei Xie, Northwestern Polytechnical University

Qingqing Zhang, Magic Data

Shuai Wang, Shenzhen Research Institute of Big Data (SRIBD)

Lei Luo, Magic Data

Peng Sun, China Computer Federation

Minghui Dong, Institute for Infocomm Research (I2R), Singapore

Liumeng Xue, The Chinese University of Hong Kong (Shenzhen)

Jixun Yao, Northwestern Polytechnical University

Dake Guo, Northwestern Polytechnical University

Hanzhao Li, Northwestern Polytechnical University

Kangxiang Xia, Northwestern Polytechnical University

lg-npu.png lg-npu-slp.jpg md.jpg lg-sribd.png lg-ccf.png lg-cuhk.png lg-i2r.png

Contact Us

If you have any questions, please contact xkx@mail.nwpu.edu.cn for assistance.

WeChat Group for Challenge Participants :

QR-code

TOP
Talk to Magic Data