According to Reuters, Amazon will upgrade Alexa's vocal simulation capabilities, which can simulate human voices in recordings based on recordings of less than a minute. Rohit Prasad, Amazon's senior vice president, said that during the epidemic, many people have lost a loved one, and the new feature can make the memory last. This is a benefit for those with language barriers or missing relatives, but for those with ulterior motives, it is a deceiving tool brought by AI, that is, DeepFake. Deepfake technology has risen rapidly in recent years, and is widely used in political smearing, military deception, economic crimes and even terrorist actions between countries, posing new threat to political security, economic security, social security, national security and other national security.
Discussion about responsible A -- the practice of designing, developing, and deploying AI with good intention to empower employees and businesses, and fairly impact customers and society, are widespread.
In general, DeepFake is to replace the face, tone or video face in pictures, sounds, videos through artificial intelligence synthesis technology. For example, in 2017, a Reddit social networking site user named "deepfakes" posted on the Reddit social networking site. Fake videos that map the faces of actresses like Scarlett Johansson to porn performers. Coincidentally, a gang that was recently exposed by CCTV to attract overseas fraudulent organizations, they used robots to automatically make 17 million harassing calls, and finally screened out more than 800,000 valid "customers", and received a total of nearly 180 million yuan in "pulling people" commissions. Compared with the synthetic counterfeiting of images and videos, audio synthesis is easier, more widely used in daily life, and more likely to be used by fraud gangs. Among them, voice conversion in speech synthesis converts one person's timbre into another person's timbre, which is the mainstream AI voice timbre conversion technology.
A typical speech conversion scheme includes speech analysis, mapping and reconstruction modules as shown in the figure below, which is called analysis-mapping-reconstruction. The model needs to extract the speaker representations of the source speaker and the target speaker, and replace the target speech speaker information with the representation of the target speaker. The picture is from the paper: Berrak Sisman's paper "An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning". Among them, the development of acoustic models is extended from traditional statistical models to deep learning models.
The above voice conversion technology requires professional VC data of TTS for training. According to the design of the algorithm model, the data can be parallel or non-parallel. Magic Data provide professional data service in TTS, covering dozens of languages, including English, Mandarin, Portuguese, Korean, etc. An example of this is as follows:
Detection of Fake Voice
While voice conversion brings benefits to the mass media, it may bring "fraud" opportunities to criminals. Therefore, fraudulent speech detection has become a new direction in the field of deep learning. Most of the main detection techniques are still based on deep learning models. A discriminator is added on the basis of the downstream task to judge whether the data input to the model is true or false. Both the training of the discriminative model and the training of the synthetic model are inseparable from the support of TTS data.
Deep learning models bring infinite possibilities to the development of artificial intelligence. While the notion of making AI systems transparent, fair, secure, and inclusive should be bared in mind when building advanced AI. Only in this way can we trust AI and scale it up with confidence.