Overview #
- Vision models can process both images and natural language
- A vision language model generally consists of:
- An image encoder
- A text encoder
- A way to fuse the two encodings together
- The latest models predominantly adopt image and text encoders with transformers to separately or jointly learn text and image features