Overview #
- Emu Video is a text-to-video generation model from Meta
- This can take a text-prompt as input, or an additional image input
- The approach is pretty straightforward
- Generate an image from the text
- Generate a video from the image and text
- The idea here is that this is now two sub-problems, and the video generation has a stronger conditioning signal with the two inputs
- This model outperforms commercial solutions like Runway Gen2
- Goals
- The goal of a text-to-video model is to input a text prompt and generate a video consition of
T
RGP frames
- The goal of a text-to-video model is to input a text prompt and generate a video consition of
History #
- Large scale text-to-image models are trained on large datasets of text-image pairs
- While these models can be adapated to text-to-video generation by using video-text pairs, these lag in quality and diversity
- Video generation is more challenging and requires modeling a higher dimensional spatiotemporal output space than images
- video-text datasets are also usually much smaller than text-to-image
- The dominant paradigm in video generation is to use diffusion models to generate all video frames at once
- So why aren’t video generations autoregressive like in NLP?
- Generating one frame in a diffusion model takes a while and already takes many iterations
- So why aren’t video generations autoregressive like in NLP?
- Text-to-video (T2V) generation
- Most prior works leverage text-to-image (T2I) models
- Several works try to inject motion information into the T2I
- This offers limited quality
- Many other prior works instead try to learn a direct mapping from text-to-video
- This is accomplished by adding temporal parameters to a T2I model