MagicData R&D Center New Findings —Conversational datasets showed BETTER performance

Date : 2022-08-01     View : 940

- Compared with the open source speech data - MagicData with manual annotation and expert proofreading, under the same model, the word accuracy rate is increased by 33% - The MagicData speech corpus has various styles, and the recording environment is closer to the real scene. It can train a more robust speech recognition model.

With the wide application of artificial intelligence in all walks of life, artificial intelligence applications with good performance in dialogue scenarios are becoming more and more popular. As the cornerstone of building conversational AI today, machine learning requires more reliable data. With this in mind, MagicData R&D Center conducted some comparative experiments to explore which types of data can have robust performance when building conversational AI models.

In the experiment, the MagicData R&D Center used the same amount of open-source reading speech data, MagicData reading speech data, and MagicData conversational speech data to train three ASR models on the same baseline. The three models are then tested under the scenarios of chit chat, human-machine conversation, and real-time captioning, respectively.

As illustrated in Table 1, we can conclude that compared to the models trained on the open source speech dataset and the MagicData dataset, the latter outperforms in all scenarios such as chit chat, human-machine conversation, and real-time captioning. Especially under chit chat scenarios, a 33% improvement is achieved.

Word accuracy (%) comparison of ASR models trained with different data - Magic Data R&D Center

Founded in 2016, MagicData has more than 400 off-the-shelf licensed datasets ready-to-use for training ASR, TTS, and NLP models. The company has been dedicated to promoting the development of conversational AI by building the largest collection of conversational training datasets in over 60 languages and dialects, covering customer service, chit chat, HMI, command and query, and many other conversation scenes. GDPR compliant and ISO/IEC 27701:2019 certified, Magic Data off-the-shelf datasets are the cost-effective data solution for your state-of-the-art Conversational AI products.

