
The first generation of vision-language models, exemplified by CLIP and ALIGN, expanded on large-scale classification pretraining by utilizing web-scale data without requiring extensive human labeling. The second generation, akin to T5 in language modeling, unified captioning and question-answering tasks through generative encoder-decoder modeling.

PaliGemma integrates the SigLIP and Gemma models through a linear projection layer that aligns the dimensions of image and text tokens, allowing them to be concatenated. The SigLIP ViTSo400m image encoder processes input images into a sequence of tokens, while the Gemma-2B v1.0 decoder-only language model handles text using its SentencePiece tokenizer. This combination enables PaliGemma to handle various tasks, including image classification, captioning, and visual question-answering, through a flexible image+text in, text out API.

PaliGemma is a powerful open vision-language model that can handle tasks such as image and short video captioning, visual question answering, text recognition in images, object detection, and segmentation6. It can also be fine-tuned for tasks like captioning and referring segmentation.