Machine Translation—Will Machines Ever Be Able to Speak and Understand?

Date : 2022-08-26 View : 2779

Since 2013, with the development of deep neural networks, the effect of machine translation has improved significantly, but it has not yet reached the point where it can "understand" the language that needs to be translated. There are more than 7,000 languages identified in the world. Among them, Chinese, English, Spanish, Russian, Arabic and French are the main languages in the world and the main working languages of the United Nations. The top ten most spoken languages in the world are, in order: Chinese, English, Russian, Spanish, Hindi, Arabic, Portuguese, Bengali, German and Japanese.

Barriers to Machine Translation

At present, the most famous translation service, Google Translate, can only translate a few more than 100 languages and the recognition accuracy for some small languages is only about 60%. Here are two examples of common identification errors:

It can be seen from the above two examples that the main reason for the error is that it is difficult for the machine to translate freely, and it can only draw the scoop according to the gourd, and the meaning cannot be completely correct after translation.

Another barrier to machine translation is the scarcity of corpora in small languages. Not to mention covering all languages in the world, it is the “Belt and Road” currently being carried out by the country, and its cooperation documents have involved more than 110 languages. There are 65 countries along the "Belt and Road", and 64 countries other than China speak about 80 languages. In addition to the fact that many countries use the same language as the official language, 56 official and common languages are actually used, involving Sino-Tibetan, Indo-European, Ural, Altai, Shem-Han, Caucasus and Dravidian. language family. In addition, there are countless national languages, as well as various dialects. Due to many factors, some of these countries have not formally organized their own languages, and it is very difficult to obtain a parallel corpus of the corresponding language.

The future of machine translation

Faced with the inability to accurately convey the challenge of free translation, on the one hand, researchers have begun to combine multi-task learning to assist machine translation, infer the context of the context through the knowledge graph, and then predict and correct the current sentence. On the other hand, expanding the expected scale of folklore and idioms in the language training corpus allows the deep model to learn the corresponding corpus scenarios and usage methods, that is, introducing "knowledge" to the machine. For example, there is ambiguity in the example of "China Pakistan". If there is no context reference, the machine cannot determine which country "Ba" is the abbreviation for.

In view of the scarcity of corpora in small languages, professional data providers can help researchers collect more corpora in small languages faster. In addition, the collection of corpus requires professional data collection, labeling, and cleaning, and also requires the guidance of linguistic experts.

But even if a small language corpus is collected, it cannot match the top ten languages spoken in the world. Therefore, it is also necessary to use low-resource transfer learning, model adaptation and other deep learning methods to transfer translation models that are well used in English or Chinese to small languages, so as to realize the meaning of small language recognition.

Recommendations:

MDT-NLP-F002 Chinese Bahasa Indonesia Parallel Corpus

MDT-NLP-F005 Chinese English Filipino Parallel Corpus

Latest Press

Qingqing ZHANG: Conversation Data Promotes AIGC—Training Data of Large-Scale Models

"Training data is technology " .

That’s what OpenAI co-founder Ilya Sutskever said when taking interview with The Verge. ChatGPT amaze the world since its release. The stunning performance of GPT-4 makes us believe we have enter a new era in AI.

What makes large model so omniscient? In our opinion, the reason may lie in the data...

This article is a collection of Dr. Qingqing Zhang’s thoughts on data, large models and generative AI.

Integrating ASR with Text Summarizer, Secure Your Leading Position in Web Conferencing Market with Magic Data Multi-Person Spontaneous Meetings Dataset

Online meetings have become a frequently used tool for business and learning. How to meet the more diversifying online conferencing needs of users has brought great challenges to remote work applications, including captioning, real-time machine translation, smart meeting minutes and other artificial intelligence applications.

Open Dataset | Automobile Cabin Voice Interaction Data Solution

In recent years, with the development of artificial intelligence, chip technology, and new innovations in the automotive industry have been driven by the increase in smart car popularity. A smart car consists of three parts: The Internet of Vehicles, the smart cockpit, and the autonomous driving. The smart cockpit is equipped with intelligent and networked in-vehicle software, which can intelligently interact with people, roads, and vehicles. It is an important link and key node for the evolution of the human-vehicle relationship from a tool to a partner.

The Future of Virtual Companionship

Nowadays, more and more young people are buying chat services on e-commerce platforms to accompany them virtually and confiding in “chat buddy” to communicate and express their feelings. Prices for various degrees of companionship range from tens of yuan to the customized "virtual lover" for thousands of yuan. In recent years, virtual companionship services have become a fashionable self-healing way for young people to seek spiritual comfort and express their voices on the Internet. There are many stores on Taobao that provide this service, such as "gentle and cute little sweetheart", "overbearing dictatorial president fan", as long as you pay, you can find your favorite "buddy".

Will Humans Be Replaced by AI?

AI-generated art has experienced rapid growth in both popularity and accessibility over the past few months. With engines like DALL-E, Midjourney, and Stable Diffusion spurring an influx of AI-generated artwork on online platforms.

News

Machine Translation—Will Machines Ever Be Able to Speak and Understand?

Get Started?