عرض لفترة محدودة- خصم 50% سنوياًاسترداد
20 نوفمبر 2025Research

Introducing Fish-Speech: A Next-Generation Multilingual TTS

Introducing Fish-Speech: A Next-Generation Multilingual TTS

Takeaways

  • We introduce Fish-Speech, a SoTA transformer based autoregressive multilingual TTS
  • We use a novel dual-AR architecture for stable and natural prosody
  • Firefly-GAN vocoder with near 100% codebook utilization for expressive speech
  • Trained on 720K hours of data and built for real-time AI agents

Technical Paper: https://arxiv.org/abs/2411.01156


Fish-Speech is a new multilingual text-to-speech system that brings LLM reasoning directly into the speech pipeline. Instead of depending on fragile grapheme-to-phoneme rules, it uses language models to understand text natively, making it far better at polyphonic expressions, mixed-language content, and context-heavy inputs.

Dual-AR Architecture

The system uses a Slow Transformer for high-level linguistic structure and a Fast Transformer for acoustic detail. This two-stage process stabilizes generation, improves codebook usage, and removes diffusion latency. With KV-cache and other optimizations, Fish-Speech can respond with around 150ms first-packet latency, making it ideal for interactive agents.

Firefly-GAN Vocoder

At the audio layer, the Firefly-GAN vocoder combines depthwise/dilated convolutions with grouped scalar vector quantization. This design reaches nearly full codebook utilization and handles emotional and multilingual synthesis efficiently while keeping audio quality extremely high.

Trained at Scale

Fish-Speech was trained on 720,000 hours of multilingual audio across major language families. The balanced dataset helps the model maintain consistent quality across languages, accents, and mixed-language scenarios.

Voice Cloning Quality

The system achieves leading performance in word error rate, speaker similarity, and MOS—beating strong baselines and even surpassing ground-truth transcripts in WER. It preserves timbre, prosody, and identity with high fidelity.

Try It

Fish-Speech is open-source at:

Shijia Liao

Shijia LiaoX

Lengyue is the founder of Fish Audio and a cracked researcher pushing breakthroughs in Voice AI. Follow his work at @lengyuematrix.

اقرأ المزيد من Shijia Liao

أنشئ أصواتًا تبدو حقيقية

ابدأ في إنشاء أعلى جودة صوت اليوم

هل لديك حساب بالفعل؟ تسجيل الدخول