Voice Mode for Gemini CLI via Gemini Live API

A developer known as kstonekuan has published an open-source voice extension for Google's Gemini CLI that brings real-time speech-to-text input to the terminal, distributed via npm as @kstonekuan/gemini-voice and installable with a single gemini extensions install command. The project ships two components: a standalone gemini-voice CLI tool featuring a live audio waveform display, and a Gemini CLI extension that adds a /voice command so users can speak their prompts instead of typing them. The author explicitly frames it as "voice mode for Claude Code, but for Gemini CLI" — a direct nod to Anthropic's native voice rollout on March 3, 2026, which the developer cites as inspiration.

The technical implementation combines a native Rust addon using the cpal audio library for microphone capture with a TypeScript layer that streams 16kHz PCM audio over WebSocket to the Gemini Live API. Notably, the Gemini Live API is a speech-to-speech conversational API, but kstonekuan repurposes it purely for its transcription and server-side voice activity detection capabilities, discarding the model's audio responses entirely. This architecture means no local VAD logic is needed on the client side — the server detects when speech ends and returns incremental inputTranscription messages, automatically shutting down the process once transcription is complete. Pre-built native binaries are included, so end users have no need for a Rust toolchain.

The project's current limitations stem from constraints in Gemini CLI's extension system rather than the voice tool itself. Because Gemini CLI's subprocess syntax suppresses live output from extensions, the interactive waveform UI is unavailable during /voice usage, and there is no push-to-talk hotkey support. The author is candid that a first-class voice experience requires native integration into Gemini CLI itself, and describes this extension as a deliberate stepping stone toward that goal. An open feature request on the Gemini CLI repository (Issue #6929) already tracks a proposed bidirectional /talk voice mode with a pluggable MCP-based backend.

CLI voice tooling has generally taken one of two implementation paths: record-then-transcribe, as Aider does via OpenAI Whisper, or real-time streaming with server-side VAD, as this project and VoiceMode MCP both do. VoiceMode MCP — which supports Claude Code, Cursor, Windsurf, Zed, and VS Code — has accumulated 894 GitHub stars as of March 2026 and is the most mature cross-tool implementation available. In each case, community projects have arrived before any native equivalent, with Claude Code's March 2026 rollout now raising expectations for what Gemini CLI might eventually offer directly.