Gemini Embedding 2: Google's First Natively Multimodal Embedding Model

Google DeepMind on Thursday released Gemini Embedding 2, a model that does something none of its predecessors could: take text, images, video, audio, and documents and map them all into the same embedding space in a single pass.

The model is available now in public preview through both the Gemini API and Vertex AI. On the input side, it handles up to 8,192 text tokens, six images, two minutes of video, audio without requiring transcription, and PDFs up to six pages — covering the messy mix of formats that real-world knowledge bases tend to accumulate. It supports over 100 languages.

The more technically interesting aspect is the use of Matryoshka Representation Learning, which lets teams shrink the output dimensions from the default 3,072 down to 1,536 or 768 without retraining the model. Smaller vectors mean lower storage costs and faster retrieval — a real consideration when running search over millions of documents. MRL handles this tradeoff more gracefully than simply truncating output, though smaller dimensions do sacrifice some retrieval quality.

The model also processes interleaved inputs — an image and a caption submitted together in one request, for example — rather than handling each modality in isolation. That's a meaningful distinction for teams building retrieval pipelines over content where the relationship between media types carries meaning.

What this changes for agentic systems is the elimination of a routing problem. A RAG pipeline handling mixed media has typically required separate embedding models per modality, plus logic to merge or reconcile results across them. Gemini Embedding 2 collapses that into a single API call. The model integrates out of the box with LangChain, LlamaIndex, Haystack, Weaviate, QDrant, and ChromaDB.

Google isn't alone here — Cohere and others have been working the multimodal embedding space for some time. But attaching this to the Gemini API and Vertex AI distribution gives it a different reach. How well it retrieves in practice, particularly at scale with genuinely mixed-media corpora, is the thing worth testing before committing it to production.