Search Articles

Overview #

Vision models can process both images and natural language
A vision language model generally consists of:
- An image encoder
- A text encoder
- A way to fuse the two encodings together
The latest models predominantly adopt image and text encoders with transformers to separately or jointly learn text and image features