Google's Gemma 4 models now run directly on iPhones. Full offline inference, no cloud calls. Gemma 4 Runs Fully Offline on iPhone by downloading the Google AI Edge Gallery app from the App Store, pick a model variant, and start generating. It's available now.

The model family spans from a lightweight 2B parameter version up to a 31B variant that benchmarks competitively with Qwen 3.5's 27B model. Google's own app defaults to the E2B variant, which makes sense. Smaller models run faster on mobile hardware where memory and thermal constraints bite.

The technical catch: inference routes through the iPhone's GPU via Metal, not Apple's Neural Engine. The ANE is more power-efficient for matrix operations. But it's locked to CoreML and optimized for specific data types like INT8. Running Gemma's FP16 and custom 4-bit quantizations through the ANE would require complex, lossy conversion. So the GPU path trades battery life for compatibility and the ability to run unmodified model weights. Some developers call this a limitation. It's pragmatic.

Developers are already shipping real tools on top of it. A project called Pucky uses Gemma 4 as an offline "vibe coder" that generates and edits TypeScript code on-device. The developer notes the 4B model works but the 2B variant runs better on current iPhone hardware due to memory limits. That gap closes as devices get more RAM.

The AI Edge Gallery app does more than just run models. It handles image recognition, voice interaction, and ships an extensible Skills framework for building on top. Google is treating this as a platform, not a demo.

For the agent space, local inference changes what's buildable. Field work, healthcare, any environment where data privacy rules out cloud processing or connectivity isn't guaranteed. If you can run a capable model on a phone without network access, you can build agents that actually operate in those conditions. The battery drain is real but solvable. Google shipping this at all tells you where they're headed.