Overview #

Muse #

  1. Text prompts are provided, gets encoded into a vector (4096)
  2. Cross-attention from the text to the image tokens is used to guide the generation process
  3. Uses a VQGAN model to generate the image tokens
  1. Masked tokens and text tokens are fed into the base transformer model