Google Releases Gemini Embedding 2: First Natively Multimodal Embedding Model

Google DeepMind released Gemini Embedding 2 on March 10, 2026, its first architectural move from text-only to fully multimodal embeddings. The model, co-announced by Product Manager Min Choi and Distinguished Engineer Tom Duerig, maps text, images, video, audio, and documents into a single unified embedding space — enabling cross-modal retrieval and classification across diverse media types in a single API request. It supports over 100 languages and accepts interleaved multimodal inputs, meaning developers can pass combinations like image-plus-text together rather than processing each modality separately. The model is available in public preview via the Gemini API and Vertex AI.

On the technical side, Gemini Embedding 2 incorporates Matryoshka Representation Learning (MRL), allowing developers to scale output dimensions flexibly up to the default 3072 — trading embedding size against performance depending on their storage and latency constraints. Google recommends 3072, 1536, or 768 dimensions for highest quality results. The model integrates directly with LangChain, LlamaIndex, Haystack, Weaviate, QDrant, and ChromaDB — covering the major LLM frameworks and vector databases used in RAG pipelines, semantic search, and data clustering.

Duerig's involvement signals the strategic weight Google is placing on this release. His research career traces a direct line to this product: he co-authored ALIGN in 2021, the dual-encoder vision-language model trained on over a billion noisy image-text pairs that established key principles behind modern multimodal contrastive embeddings, and was a named author on the 2025 Gemini Embedding text-only technical paper. That a Distinguished Engineer — one of fewer than a handful at that level across all of Google — personally co-authored the public launch blog post rather than delegating to a product team is an indicator of organizational priority, not routine release cadence.

Community reaction has been substantive but not uniformly enthusiastic. A recurring point of comparison is Alibaba's Qwen multimodal embedding model, an open-weight alternative that can be run locally and offers prompt-steerable embeddings — a meaningful differentiator for developers prioritizing data privacy or cost control over API convenience. Critics also flagged the absence of pricing details in the initial announcement and limited context window support for non-text modalities as gaps that could slow enterprise adoption. Google has not announced pricing or a general availability date. Until it does, enterprise teams evaluating multimodal retrieval pipelines have reason to keep Qwen on the shortlist.