Conversations with AI has been more and more common in our daily life and the state-of-the-art AI technology has brought much convenience and happiness to us. The explosive development of conversational AI lead to many AI applications, among one of them are the LaMDA just released by Google, which can start a conversation with any given topics. According to statistics of IDC, a market research company, conversational market of China will reach 1.86 billion USD by 2023, and the average compound growth rate (CAGR) of 2019-2023 will be 34.0%.
A batch of datasets for conversational AI were newly release on MagicHub.io, our open-source community. Let’s have a quick look.
Mandarin Chinese Conversational Speech Corpus – Web Meeting This open-source dataset consists of 5.2 hours of transcribed Mandarin Chinese conversational speech on web meetings between laptops and mobiles. Click here to download.
Zhengzhou Dialect Conversational Speech Corpus This open-source dataset consists of 4 hours of transcribed Zhengzhou dialect conversational speech on certain topics. Click here to download.
Besides, English & Czech Telephone Conversation Data from Vystadial, developed for training acoustic models for automatic speech recognition in spoken dialogue systems, can also be downloaded via our community. Click here to download.
In addition to datasets for conversational AI, there are also some scripted speech corpus.
German Scripted Speech Corpus – Command and Query This open-source dataset consists of 0.71 hours of transcribed German scripted speech focusing on commands and queries, where 597 utterances contributed by ten speakers were contained. Click here to download.
Zhengzhou Dialect Scripted Speech Corpus – Daily Use Sentence This open-source dataset consists of 5 hours of transcribed Zhengzhou dialect scripted speech focusing on daily use sentences, where 5,132 utterances contributed by ten speakers were contained. Click here to download.
Last, maybe the Chinese-English Parallel Corpus, consisting of a hundred sentences of Chinese-English parallel corpus translated from Chinese to English, concerning finance-related daily use sentences, may also deserve your attention. Click here to download.
Until now, more than 40 sets of datasets have been released on MagicHub.io. For more datasets, please visit https://magichub.io/category/datasets/ for downloading.
We will continue to release more datasets on the community. Stay tuned!
The demand for a quick, intelligent and natural-sounding conversation between human and machine is increasing.
The 3rd China Automotive Intelligent Summit 2021, took place on 27-28, Sept. 2021, Shanghai, gathers about 120 experts and executives from the automotive industry to focus on the networked technology, software development, hardware innovation, business model and user insight of intelligent cockpit, and provide an in-depth comprehensive analysis of the opportunities and challenges of intelligent cockpit development.
As AI research and development is moving forward both in depth and breadth, the needs for structured data grow explosively. Meanwhile, the data labeling industry is undergoing decentralization: production of structured data is shifting from large-scale third party data processing centers to scattered data end-users.
The annual INTERSPEECH, held between August 30, and September 3, 2021, is a global conference organized by International Speech Communication Association (ISCA). The INTERSPEECH 2021 is held in hybrid form, that is participants can join the conference virtually online and physically in Brno, The Czech Republic.
Annotator® 5.0, an independently developed data annotation system, was official launched by Magic Data Tech on July 8, 2021 on World Artificial Intelligence Conference (WAIC) 2021, which is a global gathering and exchange of AI innovation ideas, technologies, and applications, held on Shanghai between July 7 and July 10.