Does Speech Recognition Applications Perform Better than the Human Ear?

Date : 2022-07-27 View : 1325

In recent years, with the development of artificial intelligence technology, the performance of speech recognition application has been significantly improved. Many companies claim that the accuracy rate of speech recognition technology has reached more than 98%. Has the performance of speech recognition exceeded the human ear? There is something more need to be discussed before we making the final conclusion.

A saying in the Internet goes like this: leaving the test set to say that the accuracy is like a hooligan. In a quiet environment, the accuracy rate is about 98%, but in a noisy environment, the accuracy of speech recognition will drop rapidly. In a party scenario, it is difficult for the speech recognition application to pick up the target speaker's speech from the overlapping speeches, and it is even more difficult to recognize it accurately. This is a classic problem in the field of speech recognition, the Cocktail Party Problem. It is human instinct to hear the voices that you want to pay attention to in the mixture of different voices. But for the machine, this is a mess, and the target speech must be separated through the speech separation technology, and then be recognized.

Speech Separation Algorithm Based on Neural Network

Speech separation is the first step in solving the "cocktail party" problem in speech recognition. Adding speech separation technology to the front end of speech recognition to separate the target speaker's voice from other interference can improve the robustness of the speech recognition system. The cocktail party problem refers to the interference and noise interference of the voice of other people besides the main speaker in the collected audio signal. The goal of speech separation is to separate the main speaker's speech from these disturbances. The current mainstream speech separation algorithms are based on neural networks. The main purpose of neural networks is to learn an ideal binary mask (IBM) to determine which time-frequency units (Time-frequency units) the target signal in the spectrum is in. Dominate in. If an auditory signal is represented in two dimensions, time domain and frequency domain (time-frequency two-dimensional), we can express the two dimensions of time and frequency as a two-dimensional matrix, and each element in this matrix is called a time frequency unit. If the target signal does not need to be divided so finely, it only needs to be divided once - either it belongs to the target sound source or it is background noise, then the time-frequency unit can be quantified into 2 values, such as 0 and 1, which is a binary value. Thus, from the perspective of ideal binary masking, the problem becomes a supervised learning classification problem.

Multimodal Combination Speech Separation Algorithm

In addition to the above-mentioned speech separation algorithm based on speech only to solve the cocktail party problem, there are many recent articles to solve the cocktail party problem with multimodal methods. Google searched 100,000 high-quality lectures and speech videos from YouTube to generate training samples, and through the analysis of about 2,000 hours of video clips, trained a multi-stream convolutional neural network (CNN)-based model to segment synthetic cocktail party clips into a separate audio stream for each speaker in the video. In the experiments, the input was a video of one or more vocalized subjects simultaneously disturbed by other subjects or a noisy background. The output is to decompose the audio track of the input video into pure audio tracks and correspond to the corresponding speakers.

Whether it is a multi-modal or single-modal speech separation algorithm, it is inseparable from the use of the conversational speech data. However, the collection and annotation of conversational speech data is costly and challenging. As a global leading AI data solution provider, Magic Data is focused on improving the data collection and annotation process for the AI industry. The company has been dedicated to promoting the development of conversational AI by building the largest conversational training dataset in over 60 languages and dialects, covering customer service, virtual assistant, machine translation, and many other AI scenarios. With more than 400 licensable off-the-shelf datasets that can be used for ASR, TTS, and NLP model training, the company supports the development of AI R&D and commercialization in cost-effective way.

Recommendation Datasets

MDT-ASR-F069 American English Conversational Speech Corpus

MDT-ASR-B007 Residential Noise Dataset

Check out www.magicdatatech.com/datasets for more information.

Latest Press

Qingqing ZHANG: Conversation Data Promotes AIGC—Training Data of Large-Scale Models

"Training data is technology " .

That’s what OpenAI co-founder Ilya Sutskever said when taking interview with The Verge. ChatGPT amaze the world since its release. The stunning performance of GPT-4 makes us believe we have enter a new era in AI.

What makes large model so omniscient? In our opinion, the reason may lie in the data...

This article is a collection of Dr. Qingqing Zhang’s thoughts on data, large models and generative AI.

Integrating ASR with Text Summarizer, Secure Your Leading Position in Web Conferencing Market with Magic Data Multi-Person Spontaneous Meetings Dataset

Online meetings have become a frequently used tool for business and learning. How to meet the more diversifying online conferencing needs of users has brought great challenges to remote work applications, including captioning, real-time machine translation, smart meeting minutes and other artificial intelligence applications.

Open Dataset | Automobile Cabin Voice Interaction Data Solution

In recent years, with the development of artificial intelligence, chip technology, and new innovations in the automotive industry have been driven by the increase in smart car popularity. A smart car consists of three parts: The Internet of Vehicles, the smart cockpit, and the autonomous driving. The smart cockpit is equipped with intelligent and networked in-vehicle software, which can intelligently interact with people, roads, and vehicles. It is an important link and key node for the evolution of the human-vehicle relationship from a tool to a partner.

The Future of Virtual Companionship

Nowadays, more and more young people are buying chat services on e-commerce platforms to accompany them virtually and confiding in “chat buddy” to communicate and express their feelings. Prices for various degrees of companionship range from tens of yuan to the customized "virtual lover" for thousands of yuan. In recent years, virtual companionship services have become a fashionable self-healing way for young people to seek spiritual comfort and express their voices on the Internet. There are many stores on Taobao that provide this service, such as "gentle and cute little sweetheart", "overbearing dictatorial president fan", as long as you pay, you can find your favorite "buddy".

Will Humans Be Replaced by AI?

AI-generated art has experienced rapid growth in both popularity and accessibility over the past few months. With engines like DALL-E, Midjourney, and Stable Diffusion spurring an influx of AI-generated artwork on online platforms.

News

Does Speech Recognition Applications Perform Better than the Human Ear?

Get Started?