AI Generated Video – The Next Step in AI Creativity

Date : 2022-11-03 View : 1355

When we are immersed in TikTok, Kuaishou, the world is quietly refreshing our cognition. Previously, through AI tools such as DALL-E, MidJourney and CrAIyon, ordinary users could enter simple text content and create artistic illustrations through artificial intelligence.

Recently, Meta and Google have gone a step further on this basis, successively launching the black technology of generating video from text and voice. Not only can you generate pictures, but you can also generate emotional video content. Generate a matching short video based on the text information entered by the user depicting a scene. These videos are very interesting, Meta also gave a sample website: make-a-video.github.io

In addition to Meta, Google is also offering two video generation competitors at the end of the holidays - Imagen Video and Phenaki. According to Google CEO Sundar Pichai, Imagen Video has a higher resolution than Meta's Make-A-Video, and can generate 1280*768 video segments at 24 frames per second. Sample website: imagen.research.google/video

In addition, Phenaki can generate a video of more than 2 minutes based on a text description of about 200 words, telling a complete short story, comparable to a small director. Sample website: phenaki.video

What technology is behind the various text-generating video technologies mentioned above?

The model architecture of Make-A-Video is shown below. The technology is improved on the basis of the original Text-to-Image. The main motivation is to understand what the world looks like, and to describe the text image data paired with it, and from unsupervised In-Video Learn real-world camera movement when recording video.

First, the authors decouple the full temporal U-Net and attention tensors and approximate them both spatially and temporally. Second, the authors design a spatiotemporal pipeline to generate high-resolution and frame-rate videos, which contains a video decoder, interpolation model, and two super-resolution models, enabling various text generation applications including Text-to-Video.

From the paper: https://arxiv.org/pdf/2209.14792.pdf

The high-level architecture diagram of Make-A-Video shows that, given an input text x translated into an image embedding by a prior P, and a desired frame rate fps, the decoder Dt generates 16 frames of 64 × 64 resolution, then It is interpolated to a higher frame rate by ↑F, and the resolution is increased to 256 × 256 for SRt l and 768 × 768 for SRh, finally generating a video with high spatiotemporal resolution y^.

Imagen Video is a diffusion model based on the recent fire, directly inheriting the image generation SOTA model Imagen.

In addition to the high resolution, three special abilities are displayed.

First of all, it can understand and generate works of different artistic styles, and the 3D structure of the object will not be deformed in the rotating display. Imagen Video is a collection of a series of models. The language model part is Google's own T5-XXL, and the text encoder part is frozen after training. Among them, the language model is only responsible for encoding text features, and the work of text-to-image conversion is handed over to the later video diffusion model. Based on the generated image, the basic model continuously predicts the next frame in an autoregressive manner, first generating a 48*24 video with 3 frames per second. The flow chart from text prompt input to video generation is shown in the following figure:

From the paper: https://imagen.research.google/video/paper.pdf

Before Phenaki, AI models could generate an ultra-short video with a specific cue, but were unable to generate a 2-minute coherent video. Phenaki implements a brain storyline, generating 2+ minutes of video.

The researchers introduced a new causal model to learn to represent videos: treating a video as a temporal sequence of images. The model is based on Transformer, which can decompose video into discrete small representations, and decompose video in temporal causal order. That is, a single cue is encoded by a spatial Transformer, and then multiple encoded cues are concatenated using a causal Transformer. Its flow chart is as follows:

From the paper: https://openreview.net/pdf?id=vOEXS39nOF

The impact of text-generated video

With the rapid development of text-generated video technology, in the future, the videos of major short video platforms may no longer be live-action shows, but synthetic video shows.

AI is revolutionizing industries, bringing challenges as well as more progress. AI will eventually lead to more jobs, says Daniel Jeffries, Stability AI's new chief information officer. Challenges and opportunities coexist at any time, and grasping the pulse of the times can create a better future.

Latest Press

Qingqing ZHANG: Conversation Data Promotes AIGC—Training Data of Large-Scale Models

"Training data is technology " .

That’s what OpenAI co-founder Ilya Sutskever said when taking interview with The Verge. ChatGPT amaze the world since its release. The stunning performance of GPT-4 makes us believe we have enter a new era in AI.

What makes large model so omniscient? In our opinion, the reason may lie in the data...

This article is a collection of Dr. Qingqing Zhang’s thoughts on data, large models and generative AI.

Integrating ASR with Text Summarizer, Secure Your Leading Position in Web Conferencing Market with Magic Data Multi-Person Spontaneous Meetings Dataset

Online meetings have become a frequently used tool for business and learning. How to meet the more diversifying online conferencing needs of users has brought great challenges to remote work applications, including captioning, real-time machine translation, smart meeting minutes and other artificial intelligence applications.

Open Dataset | Automobile Cabin Voice Interaction Data Solution

In recent years, with the development of artificial intelligence, chip technology, and new innovations in the automotive industry have been driven by the increase in smart car popularity. A smart car consists of three parts: The Internet of Vehicles, the smart cockpit, and the autonomous driving. The smart cockpit is equipped with intelligent and networked in-vehicle software, which can intelligently interact with people, roads, and vehicles. It is an important link and key node for the evolution of the human-vehicle relationship from a tool to a partner.

The Future of Virtual Companionship

Nowadays, more and more young people are buying chat services on e-commerce platforms to accompany them virtually and confiding in “chat buddy” to communicate and express their feelings. Prices for various degrees of companionship range from tens of yuan to the customized "virtual lover" for thousands of yuan. In recent years, virtual companionship services have become a fashionable self-healing way for young people to seek spiritual comfort and express their voices on the Internet. There are many stores on Taobao that provide this service, such as "gentle and cute little sweetheart", "overbearing dictatorial president fan", as long as you pay, you can find your favorite "buddy".

Will Humans Be Replaced by AI?

AI-generated art has experienced rapid growth in both popularity and accessibility over the past few months. With engines like DALL-E, Midjourney, and Stable Diffusion spurring an influx of AI-generated artwork on online platforms.

News

AI Generated Video – The Next Step in AI Creativity

Get Started?