Recently, nearly one thousand car companies attended the Shanghai Auto Show, where electrification and intelligentization were the standard equipment for many car companies. The intelligent functions of cars, such as intelligent cabins, self-driving and cloud service, depict the future of intelligent cars.
It is expected that by 2025, the global number of intelligent connected vehicle will approach 74 million, including 28 million in China. Intelligent vehicle industry ushered in a golden period of development.
To realize the human-computer interaction in vehicles, supportive computational power and related algorithms are essential in speech recognition, speech synthesis, and natural language processing. Algorithms, as well, can certainly benefit from massive, accurate, and matched data. As electricity is the dynamic source for electric cars, data is the dynamic source for AI.
In order to help with the implementation and optimization of intelligent cars, Magic Data Tech recently updated two open-source datasets available for cars in the MagicHub: In-vehicle Noise Datasets and the Mandarin Chinese Scripted Speech Corpus—In-Vehicle Scene.
In-Vehicle Noise Datasets
This open-source dataset consists of in-vehicle noise from multiple sources, which may include tire noise, engine noise, radio, human voice, etc. (Click here to download)
Mandarin Chinese Scripted Speech Corpus—In-Vehicle Scene
This open-source dataset consists of transcribed Mandarin Chinese scripted speech focusing on commands and queries in vehicle-related scenes, where 5,948 utterances contributed by ten speakers were contained. A noteworthy feature is that two microphones were set up while recording—one at the sun visor, another near the speaker’s mouth, on a front passenger seat. Synchronous dual voices, consequently, were recorded. (Click here to download)
MagicHub will continue to provide more standardized datasets of multiple dimensions and diversiform scenes for AI developers’ use.
As AI research and development is moving forward both in depth and breadth, the needs for structured data grow explosively. Meanwhile, the data labeling industry is undergoing decentralization: production of structured data is shifting from large-scale third party data processing centers to scattered data end-users.
The annual INTERSPEECH, held between August 30, and September 3, 2021, is a global conference organized by International Speech Communication Association (ISCA). The INTERSPEECH 2021 is held in hybrid form, that is participants can join the conference virtually online and physically in Brno, The Czech Republic.
Annotator® 5.0, an independently developed data annotation system, was official launched by Magic Data Tech on July 8, 2021 on World Artificial Intelligence Conference (WAIC) 2021, which is a global gathering and exchange of AI innovation ideas, technologies, and applications, held on Shanghai between July 7 and July 10.
Collecting and sorting out data have always been a time-consuming and tedious procedure for AI developers and researchers. Here we list 20 sites where high-quality data is ready and free, in hope of assisting you to locate the proper dataset for your AI modal in a better way.
A batch of datasets for conversational AI were newly release on MagicHub.io, our open-source community. Let’s have a quick look.