Overview #
- Paper Link
- This paper focuses on allowing vision models to handle localization
- The model can output coordinates, essentially allowing it to point to its outputs
- This densely aligns each output word to a pixel location
Implementation #
- The input image and an optional prompt are encoded into the text embedding space