Introducing Fish-Speech: A Next-Generation Multilingual TTS

Takeaways

We introduce Fish-Speech, a SoTA transformer based autoregressive multilingual TTS
We use a novel dual-AR architecture for stable and natural prosody
Firefly-GAN vocoder with near 100% codebook utilization for expressive speech
Trained on 720K hours of data and built for real-time AI agents

Technical Paper: https://arxiv.org/abs/2411.01156

Fish-Speech is a new multilingual text-to-speech system that brings LLM reasoning directly into the speech pipeline. Instead of depending on fragile grapheme-to-phoneme rules, it uses language models to understand text natively, making it far better at polyphonic expressions, mixed-language content, and context-heavy inputs.

Dual-AR Architecture

The system uses a Slow Transformer for high-level linguistic structure and a Fast Transformer for acoustic detail. This two-stage process stabilizes generation, improves codebook usage, and removes diffusion latency. With KV-cache and other optimizations, Fish-Speech can respond with around 150ms first-packet latency, making it ideal for interactive agents.

Firefly-GAN Vocoder

At the audio layer, the Firefly-GAN vocoder combines depthwise/dilated convolutions with grouped scalar vector quantization. This design reaches nearly full codebook utilization and handles emotional and multilingual synthesis efficiently while keeping audio quality extremely high.

Trained at Scale

Fish-Speech was trained on 720,000 hours of multilingual audio across major language families. The balanced dataset helps the model maintain consistent quality across languages, accents, and mixed-language scenarios.

Voice Cloning Quality

The system achieves leading performance in word error rate, speaker similarity, and MOS—beating strong baselines and even surpassing ground-truth transcripts in WER. It preserves timbre, prosody, and identity with high fidelity.

Try It

Fish-Speech is open-source at:

GitHub: https://github.com/fishaudio/fish-speech
Demo: https://fish.audio