Search Articles

Overview #

Why?
- Text is a very natural control mechanism for generation
- Large-scale paired image-text data can be found online (LAION-5B)
  - Various biases exist in these datasets
- Can leverage pre-trained large language models
Image editing applications can be built on top of these models
CLIP - An image-text model that can be used for zero-shot classification
- It can tell if an image and text match – even complex prompts outside of the training data
CLIP score - A measure of how well the text prompt and image match
Cross-attention - Image and text tokens are input into a transformer together

Muse #

Text prompts are provided, gets encoded into a vector (4096)
Cross-attention from the text to the image tokens is used to guide the generation process
Uses a VQGAN model to generate the image tokens

Variable ratio masking
- A random number of tokens are dropped in training as the information makes its way through the network

When you train a low-res model you first get the scene correct
- Building higher resolution samples on top of this seems like what they did?
Muse uses token based super resolution (going from 256x256 to 512x512) rather than a diffusion super resolution
- This leads to a better output
Iterative decoding is crucial to good inference quality
- The base Muse model uses 24 steps
Whenever the count of something goes beyond 6 or 7 the model starts making mistakes
- The hypothesis here is that the data does not have examples of counts greater than 6 or 7
It can render text well as long as there isn’t more than one or two words
Evals for image models are hard because there’s no good way to evaluate them
- Sometimes you just have to look at the images and see if they look good
- This is an open problem
The model seems to generate random scenes even if the prompt is nonsense
- This is then what gets fed through the decoder
Muse will generate like 16 images, and then you pick the best one
- There is no self-correcting way to pick the best image automatically
Text guided inpainting
- You can highlight a selection of the image and ask it to edit that part of the image
- Outpainting is the opposite of inpainting – editing the outside of the box
- Muse can do this zero-shot
- This is an example of AI image editing
Because muse is fast you can do interactive editing
The reason why muse is fast is it does parallel decoding
- In diffusion models, you have to decode each timestep sequentially