
Nomic Embed Vision models integrate with Nomic Embed Text models by sharing the same latent space, creating a unified embedding space23. This means that the vision encoder is aligned with the existing Nomic Embed Text latent space, resulting in a multimodal latent space that excels in both image and text tasks2. This integration allows for seamless multimodal tasks across different versions of the models and enables the use of Nomic Embed Text embeddings to query the new Nomic Embed Vision models2.

Nomic Embed Vision v1.5 shows slight improvements in performance metrics compared to Nomic Embed Vision v1. On various benchmarks, Nomic Embed v1.5 outperforms v1, indicating the robustness of these models. For example, Nomic Embed v1 achieved 70.70 on Imagenet 0-shot, 56.7 on Datacomp Avg., and 62.39 on MTEB Avg. Nomic Embed v1.5 performed slightly better in these tests.

The primary capabilities of the Nomic Embed Vision models v1 and v1.5 are providing high-quality, fully replicable vision embeddings that seamlessly integrate with the existing Nomic Embed Text v1 and v1.5 models. This integration creates a unified embedding space that enhances the performance of multimodal and text tasks2. These models are designed to address the limitations of existing multimodal models, such as CLIP, which underperform tasks outside image retrieval. By aligning a vision encoder with the existing Nomic Embed Text latent space, Nomic Embed Vision models offer a unified multimodal latent space that excels in image and text tasks, outperforming competitors like OpenAI CLIP and OpenAI Text Embedding 3 Small on benchmarks like Imagenet 0-Shot, MTEB, and Datacomp145.