Conversations with AI has been more and more common in our daily life and the state-of-the-art AI technology has brought much convenience and happiness to us. The explosive development of conversational AI lead to many AI applications, among one of them are the LaMDA just released by Google, which can start a conversation with any given topics. According to statistics of IDC, a market research company, conversational market of China will reach 1.86 billion USD by 2023, and the average compound growth rate (CAGR) of 2019-2023 will be 34.0%.
A batch of datasets for conversational AI were newly release on MagicHub.io, our open-source community. Let’s have a quick look.
Mandarin Chinese Conversational Speech Corpus – Web Meeting This open-source dataset consists of 5.2 hours of transcribed Mandarin Chinese conversational speech on web meetings between laptops and mobiles. Click here to download.
Zhengzhou Dialect Conversational Speech Corpus This open-source dataset consists of 4 hours of transcribed Zhengzhou dialect conversational speech on certain topics. Click here to download.
Besides, English & Czech Telephone Conversation Data from Vystadial, developed for training acoustic models for automatic speech recognition in spoken dialogue systems, can also be downloaded via our community. Click here to download.
In addition to datasets for conversational AI, there are also some scripted speech corpus.
German Scripted Speech Corpus – Command and Query This open-source dataset consists of 0.71 hours of transcribed German scripted speech focusing on commands and queries, where 597 utterances contributed by ten speakers were contained. Click here to download.
Zhengzhou Dialect Scripted Speech Corpus – Daily Use Sentence This open-source dataset consists of 5 hours of transcribed Zhengzhou dialect scripted speech focusing on daily use sentences, where 5,132 utterances contributed by ten speakers were contained. Click here to download.
Last, maybe the Chinese-English Parallel Corpus, consisting of a hundred sentences of Chinese-English parallel corpus translated from Chinese to English, concerning finance-related daily use sentences, may also deserve your attention. Click here to download.
Until now, more than 40 sets of datasets have been released on MagicHub.io. For more datasets, please visit https://magichub.io/category/datasets/ for downloading.
We will continue to release more datasets on the community. Stay tuned!
Annotator® 5.0, an independently developed data annotation system, was official launched by Magic Data Tech on July 8, 2021 on World Artificial Intelligence Conference (WAIC) 2021, which is a global gathering and exchange of AI innovation ideas, technologies, and applications, held on Shanghai between July 7 and July 10.
Collecting and sorting out data have always been a time-consuming and tedious procedure for AI developers and researchers. Here we list 20 sites where high-quality data is ready and free, in hope of assisting you to locate the proper dataset for your AI modal in a better way.
Recently, Beijing Municipal Bureau of Economy and Information Technology officially released the list of “First Group of the Top Specialized, Fine, Special and Innovative company”. Magic Data Tech was named in the list for its professional and innovative services in the field of AI data services.
We are proud to announce Magic Data Tech has been named the “Best Supplier of Alibaba Cloud 2021”.
On May 20, 2021, Intel published the 5th issue of its AI 100 Acceleration Program list at the 2021 Shenzhen (International) Artificial Intelligence Exhibition, and Magic Data Tech was selected for the program by relying on its strong innovation strength.