Data — the Silver Bullet to Code-Switch Speech Recognition

Date : 2022-08-03 View : 1396

With the development of the Internet and globalization, people's daily language communication is often mixed with other languages, such as: "我的IPAD不能下载APP了，可以陪我去APPLE store修理一下吗？ (My IPAD can't download this APP, can you accompany me to the APPLE store to repair it?)” , “明天就是deadline了，我的paper还没有ready。 (Tomorrow is the deadline, my paper is not ready.)”, "老板的schedule需要调整，麻烦你check一下你得email。(The boss's schedule needs to be adjusted, please check your email.)” ...

This kind of Chinese mixed with English frequently appears in our daily communication. In addition to English, other minor languages appear in Chinese sentences. This phenomenon is academically called code-switching and is one of the important challenges facing speech recognition technology. For various voice recognition systems such as smart speakers, smart voice assistants, and in-vehicle voice assistants, the challenges brought by code-switching are mainly reflected in the following three aspects.

Challenges

Accented Non-Native Language

When using a non-native language, many tend to speaking with an accent in their mother-tongue due to pronunciation habit. Sometimes it can easily be told which country the speaker comes from by the accent he or she speaks a foreign language. Besides different pronunciation characteristics of different language, accents of speakers of a single language from different region also show great difference. In Mandarin for example, research shows that about 80% of Mandarin speakers have dialect accents of varying degrees. The performance of speech recognition application constructed for standard language tends to drop significantly when the speaker has a certain accent.

Different Languages Have Different Phonemes

Hay and Bauer in Linguistics Student's Handbook (2007) study the number of speakers and typological information for some languages, including of course the number of phonemes. The experimental results are as follows. The horizontal axis represents the population (log population), and the vertical axis represents the number of vowels. Each small circle represents a language. The picture on the left is the case of the basic vowel, and the picture on the right is the case of the extra vowel.

The above studies believe that the number of phonemes is related to the population, which leads to the difference in phonemes between multilingual languages. Acoustic models in speech recognition typically process the raw audio waveforms of human language, predicting the corresponding phoneme for each waveform, usually at the character or subword level. The language model guides the acoustic model, discarding predictions that are impossible under the constraints of proper grammar and discussion topics. Since the code-switch contains multiple languages, the phoneme composition of the code-switch is different, which will increase the difficulty of modeling the hybrid acoustic model.

Scarcity of Annotated Code-mixing Datasets

The above two problems are both technical problems. The essential problem of code-switch speech recognition is the scarcity of annotated mixed language corpora. Since the recording of such data requires bilingual or even multilingual people, the recording cost will be higher and the time consuming will be very long, so the speech corpus in the promiscuous language is very scarce. Some papers such as Qinyanmin's Data Augmentation for end-to-end Code-Switching Speech Recognition use the TTS data augmentation scheme to improve the performance of code-switch speech recognition systems.

Solution

Although the above are three challenges of code-switch speech recognition. But the essence of its solution lies in the data. Assuming that we have enough code-switch speech recognition data, the code-switch speech recognition system will naturally improve the code-switch by letting the neural network learn relevant accents, diverse phoneme information and other problems caused by code-switch from a large amount of data. For recording code-switch speech data, professional data company Magic Data can help researchers save a lot of time and cost in data collection and annotation, thus focusing more in modeling. At present, the company has dozens licensable code-switch dataset in multiple scenarios and languages that are ready-to-use, examples of which are as follows:

Chinese-English Code-Mixing Speech Corpus—Command&Query

MDT-ASR-E030 Turkish-English Code-Mixing Speech Corpus

Check out www.magicdatatech.com/datasets for more datasets.

Latest Press

Qingqing ZHANG: Conversation Data Promotes AIGC—Training Data of Large-Scale Models

"Training data is technology " .

That’s what OpenAI co-founder Ilya Sutskever said when taking interview with The Verge. ChatGPT amaze the world since its release. The stunning performance of GPT-4 makes us believe we have enter a new era in AI.

What makes large model so omniscient? In our opinion, the reason may lie in the data...

This article is a collection of Dr. Qingqing Zhang’s thoughts on data, large models and generative AI.

Integrating ASR with Text Summarizer, Secure Your Leading Position in Web Conferencing Market with Magic Data Multi-Person Spontaneous Meetings Dataset

Online meetings have become a frequently used tool for business and learning. How to meet the more diversifying online conferencing needs of users has brought great challenges to remote work applications, including captioning, real-time machine translation, smart meeting minutes and other artificial intelligence applications.

Open Dataset | Automobile Cabin Voice Interaction Data Solution

In recent years, with the development of artificial intelligence, chip technology, and new innovations in the automotive industry have been driven by the increase in smart car popularity. A smart car consists of three parts: The Internet of Vehicles, the smart cockpit, and the autonomous driving. The smart cockpit is equipped with intelligent and networked in-vehicle software, which can intelligently interact with people, roads, and vehicles. It is an important link and key node for the evolution of the human-vehicle relationship from a tool to a partner.

The Future of Virtual Companionship

Nowadays, more and more young people are buying chat services on e-commerce platforms to accompany them virtually and confiding in “chat buddy” to communicate and express their feelings. Prices for various degrees of companionship range from tens of yuan to the customized "virtual lover" for thousands of yuan. In recent years, virtual companionship services have become a fashionable self-healing way for young people to seek spiritual comfort and express their voices on the Internet. There are many stores on Taobao that provide this service, such as "gentle and cute little sweetheart", "overbearing dictatorial president fan", as long as you pay, you can find your favorite "buddy".

Will Humans Be Replaced by AI?

AI-generated art has experienced rapid growth in both popularity and accessibility over the past few months. With engines like DALL-E, Midjourney, and Stable Diffusion spurring an influx of AI-generated artwork on online platforms.

News

Data — the Silver Bullet to Code-Switch Speech Recognition

Get Started?