Let's Learn - AI Deep Learning Diffusion Models

Search Articles

Overview #

Goals
- Stable when training
- Efficient
- High quality
- Easy to generate new samples out of training data
Diffusion models
- Text-to-image
- Takes an embedding model that converts text to a vector and then uses a diffusion model to generate an image

In VAEs the task is to generate an image in one-shot from the latent space
- Sample from the latent space and decode to get the output
Rather than one-shot prediction, diffusion models use iterative prediction
- Start from a random image and repeatedly refine it and reduce noise
When we start from a completely random state, this allows us to generate new samples that are fresh
Forward noising
- Data-to-noise
- Start with training data (e.g. example images)
- Progressively add increasing amounts of noise to the data
  - This corrupts the data until it is pure noise
The neural network is trained to reverse this (noise-to-data)
- Reverse Denoising is the process of recovering the data from the noise

Forward Noising Steps

Given an image go to random noise
Progressively add more noise
- Starting at the first timestamp T_0, progressively add more and more noise until you get to 100% noise

Reverse Denoising Steps

Given an image, can we learn to estimate the image at T-1 (the previous timestamp)
- This is the reverse denoising step
- This is the reverse of the forward noising step
- All the two steps differ by is the noise function
  - The noise applied is different at each timestamp determined by a variance schedule

When building this models you define a set number of timesteps
- The more timesteps you have better resolution, but the more expensive it is to train

We can use diffusion models for molecular design
- This can generate models in 3D
Diffusion models can be used to generate proteins
- Can we design new proteins with new biological or therapeutic properties?

Latent diffusion models operate on the latent space representation of an image rather than the raw pixels

The process is as follows:

Encode input into a latent space representation
Sample from the latent space representation
Go through the diffusion process on this latent space representation to genearte a high-quality image

Since we are operating on the latent space rather than raw pixels, any input (e.g. text) can be encoded into the latent space for generation

This is how models like stable diffusion can generate text-to-image
Because the latent space is a Gaussian distribution because of the regularization loss term (see Generative Modeling) these still allow for random sampling from the latent space to generate images not dependent on input