Overview #

Training #

  1. Image Pretraining
    • The initial model is based on an image diffusion model - Stable Diffusion
      • This means that the weights are initialized with the weights of the Stable Diffusion model
  2. Video Pretraining
    • They took subsets of the LVD dataset and samples and use human preference to identify the best samples
    • Generating a high-quality dataset here is critical for increased performance
  3. High-Quality Finetuning
    • Use a dataset of 250k pre-captioned, high-fidelity video clips

Data Curation #

Architecture #

  1. Initial base: Stable Diffusion
  2. Add temporal layers, and train on LVD-F
  3. Video pretraining on LVD-F on 14 frames at a resolution of 256x384
  4. Finetune the model to generate 14 320 x 576 frames

These four steps are the base model for the video generation. It can be finetuned to better accomplish specific tasks.

Finetuning the Base Model #

Limitations #