Hume AI Open-Sources TADA: LLM-Based TTS with Text-Acoustic Synchronization

Hume AI on March 10, 2026 open-sourced TADA (Text-Acoustic Dual Alignment), a new LLM-based text-to-speech architecture that addresses a core inefficiency plaguing current TTS systems: the token-count mismatch between text and audio. Conventional LLM-based TTS relies on fixed-frame-rate acoustic tokenization, generating 12.5 to 75 audio tokens per second against a much shorter text sequence. TADA eliminates this imbalance by aligning one continuous acoustic feature vector directly to each text token, creating a synchronized one-to-one stream through the language model. Each autoregressive step corresponds to exactly one text token, with a flow-matching head conditioned on the LLM's final hidden state generating the corresponding acoustic features. The approach is described in an accompanying arXiv preprint (2602.23068), authored by Sharath Rao, Mori Liu, and colleagues at Hume AI.

The numbers bear out the architectural trade-off. TADA achieves a real-time factor of 0.09, more than five times faster than comparable LLM-based TTS systems, because it operates at only 2 to 3 audio frames per second rather than the tens of frames per second typical of other approaches. Context efficiency follows the same pattern: where a conventional system exhausts a 2,048-token context window in roughly 70 seconds of audio, TADA fits approximately 700 seconds into the same budget. In hallucination testing across more than 1,000 LibriTTS-R samples, TADA produced zero instances with a character error rate above 0.15, a result Hume attributes to the strict structural constraint that prevents the model from skipping or inserting content. Human evaluation on the EARS expressive speech dataset placed TADA second overall, with scores of 4.18 out of 5.0 for speaker similarity and 3.78 out of 5.0 for naturalness.

The release includes two models — TADA-1B, an English model based on Llama 3.2 1B, and TADA-3B-ML, a multilingual variant based on Llama 3.2 3B covering seven additional languages — along with a shared audio tokenizer and decoder under an MIT license, hosted on Hugging Face. Hume also introduces Speech Free Guidance (SFG), a technique that blends logits from text-only and text-speech inference modes to reduce the quality drop that occurs when generating language alongside speech. The lightweight footprint makes on-device and edge deployment viable, which Hume highlights as a key use case for device manufacturers and privacy-sensitive applications in healthcare, finance, and education.

Community reception on Hacker News was measured. While the speed and hallucination-free claims drew interest, listeners noted audio artifacts in the demo clips — including a lisp in the Anger Speech sample and vocal fry in the Long Speech example — and at least one commenter questioned whether the one-to-one alignment amounts to a novel tokenization scheme or a form of concatenation without compression. Hume itself acknowledges two unresolved limitations: speaker drift during long-form generation and a modality gap that degrades language quality when text and speech are generated together. For Hume — known primarily for its empathic AI research and the Octave voice product — the open release is a concrete step into speech-language modeling infrastructure, a field where ElevenLabs, Cartesia, and others are already shipping production systems.