Gemma 4 12B drops the multimodal encoder entirely

Google has released Gemma 4 12B, an open model that runs multimodal, agentic workloads locally on a laptop with 16GB of memory while posting benchmark scores nearing its own 26B Mixture-of-Experts model at less than half the memory footprint. It is also the first mid-sized Gemma to take native audio input.

The interesting part is how it gets there. Most multimodal models bolt separate vision and audio encoders onto the language model, which adds latency and memory. Gemma 4 12B drops them. The vision encoder is replaced by a lightweight embedding module — a single matrix multiplication with positional embeddings and normalisation — and the audio encoder is removed outright, with the raw audio signal projected straight into the same space as text tokens. The language model backbone does the rest.

It ships under an Apache 2.0 licence, runs in LM Studio, Ollama and llama.cpp, and arrives with an official Gemma Skills repository aimed at agents building on the model. Gemma 4 has now crossed 150 million downloads.