Introducing Fish-Speech: A Next-Generation Multilingual TTS
Oct 14, 2025

Takeaways
- We introduce Fish-Speech, a SoTA transformer based autoregressive multilingual TTS
- We use a novel dual-AR architecture for stable and natural prosody
- Firefly-GAN vocoder with near 100% codebook utilization for expressive speech
- Trained on 720K hours of data and built for real-time AI agents
Technical Paper: https://arxiv.org/abs/2411.01156
Fish-Speech is a new multilingual text-to-speech system that brings LLM reasoning directly into the speech pipeline. Instead of depending on fragile grapheme-to-phoneme rules, it uses language models to understand text natively, making it far better at polyphonic expressions, mixed-language content, and context-heavy inputs.
Dual-AR Architecture
The system uses a Slow Transformer for high-level linguistic structure and a Fast Transformer for acoustic detail. This two-stage process stabilizes generation, improves codebook usage, and removes diffusion latency. With KV-cache and other optimizations, Fish-Speech can respond with around 150ms first-packet latency, making it ideal for interactive agents.
Firefly-GAN Vocoder
At the audio layer, the Firefly-GAN vocoder combines depthwise/dilated convolutions with grouped scalar vector quantization. This design reaches nearly full codebook utilization and handles emotional and multilingual synthesis efficiently while keeping audio quality extremely high.
Trained at Scale
Fish-Speech was trained on 720,000 hours of multilingual audio across major language families. The balanced dataset helps the model maintain consistent quality across languages, accents, and mixed-language scenarios.
Voice Cloning Quality
The system achieves leading performance in word error rate, speaker similarity, and MOS—beating strong baselines and even surpassing ground-truth transcripts in WER. It preserves timbre, prosody, and identity with high fidelity.
Try It
Fish-Speech is open-source at:
- GitHub: https://github.com/fishaudio/fish-speech
- Demo: https://fish.audio