Introducing Fish-Speech: A Next-Generation Multilingual TTS

Oct 14, 2025

LengyueLengyueResearch
Introducing Fish-Speech: A Next-Generation Multilingual TTS

Takeaways

  • We introduce Fish-Speech, a SoTA transformer based autoregressive multilingual TTS
  • We use a novel dual-AR architecture for stable and natural prosody
  • Firefly-GAN vocoder with near 100% codebook utilization for expressive speech
  • Trained on 720K hours of data and built for real-time AI agents

Technical Paper: https://arxiv.org/abs/2411.01156


Fish-Speech is a new multilingual text-to-speech system that brings LLM reasoning directly into the speech pipeline. Instead of depending on fragile grapheme-to-phoneme rules, it uses language models to understand text natively, making it far better at polyphonic expressions, mixed-language content, and context-heavy inputs.

Dual-AR Architecture

The system uses a Slow Transformer for high-level linguistic structure and a Fast Transformer for acoustic detail. This two-stage process stabilizes generation, improves codebook usage, and removes diffusion latency. With KV-cache and other optimizations, Fish-Speech can respond with around 150ms first-packet latency, making it ideal for interactive agents.

Firefly-GAN Vocoder

At the audio layer, the Firefly-GAN vocoder combines depthwise/dilated convolutions with grouped scalar vector quantization. This design reaches nearly full codebook utilization and handles emotional and multilingual synthesis efficiently while keeping audio quality extremely high.

Trained at Scale

Fish-Speech was trained on 720,000 hours of multilingual audio across major language families. The balanced dataset helps the model maintain consistent quality across languages, accents, and mixed-language scenarios.

Voice Cloning Quality

The system achieves leading performance in word error rate, speaker similarity, and MOS—beating strong baselines and even surpassing ground-truth transcripts in WER. It preserves timbre, prosody, and identity with high fidelity.

Try It

Fish-Speech is open-source at:

Create voices that feel real

Start generating the highest quality audio today.

Already have an account? Log in