The gap between open-source and commercial voice cloning has closed. Four models released in recent months clone voices from short audio samples and produce speech that genuinely rivals paid alternatives. Fish Audio S2 Pro passed an Audio Turing Test where human listeners identified it as AI only 48.5% of the time, essentially a coin flip. That's not marketing fluff. That's the benchmark.

Each model targets a different use case. OmniVoice supports over 600 languages and can generate voices from text descriptions alone, no reference audio needed. LongCat-AudioDiT skips the traditional spectrogram conversion step, working directly in waveform latent space, and hit speaker similarity scores of 0.818 on the Seed benchmark, beating previous state of the art. FireRedTTS-2 handles multi-speaker conversations, switching naturally between up to four speakers over three minutes of dialogue with latency as low as 140ms. Fish Audio S2 Pro offers granular emotional control through 15,000 unique tags, letting you embed instructions like whisper, excited, or angry directly in the text.

Licenses vary. LongCat uses MIT. FireRedTTS-2 is Apache 2.0. Fish Audio S2 Pro requires a paid license for commercial use. Hardware demands differ too, FireRedTTS-2 weighs 20.9GB. A year ago, open-source TTS had robotic cadence and speaker similarity that fooled nobody. These models don't have that problem.