عام واحد.

$52M تمويل تأسيسي.

8M+ مبدع.

اقرأ قصتنا

المنتجات

المطورون

اتصل بالمبيعات

20 نوفمبر 2025Research

Introducing Fish-Speech: A Next-Generation Multilingual TTS

Shijia Liao, Chief Scientist

Introducing Fish-Speech: A Next-Generation Multilingual TTS

Takeaways

We introduce Fish-Speech, a SoTA transformer based autoregressive multilingual TTS
We use a novel dual-AR architecture for stable and natural prosody
Firefly-GAN vocoder with near 100% codebook utilization for expressive speech
Trained on 720K hours of data and built for real-time AI agents

Technical Paper: https://arxiv.org/abs/2411.01156

Fish-Speech is a new multilingual text-to-speech system that brings LLM reasoning directly into the speech pipeline. Instead of depending on fragile grapheme-to-phoneme rules, it uses language models to understand text natively, making it far better at polyphonic expressions, mixed-language content, and context-heavy inputs.

Dual-AR Architecture

The system uses a Slow Transformer for high-level linguistic structure and a Fast Transformer for acoustic detail. This two-stage process stabilizes generation, improves codebook usage, and removes diffusion latency. With KV-cache and other optimizations, Fish-Speech can respond with around 150ms first-packet latency, making it ideal for interactive agents.

Firefly-GAN Vocoder

At the audio layer, the Firefly-GAN vocoder combines depthwise/dilated convolutions with grouped scalar vector quantization. This design reaches nearly full codebook utilization and handles emotional and multilingual synthesis efficiently while keeping audio quality extremely high.

Trained at Scale

Fish-Speech was trained on 720,000 hours of multilingual audio across major language families. The balanced dataset helps the model maintain consistent quality across languages, accents, and mixed-language scenarios.

Voice Cloning Quality

The system achieves leading performance in word error rate, speaker similarity, and MOS—beating strong baselines and even surpassing ground-truth transcripts in WER. It preserves timbre, prosody, and identity with high fidelity.

Try It

Fish-Speech is open-source at:

GitHub: https://github.com/fishaudio/fish-speech
Demo: https://fish.audio

Founder & Chief-Scientist of Fish Audio.

اقرأ المزيد من Shijia Liao

أنشئ أصواتًا تبدو حقيقية

ابدأ في إنشاء أعلى جودة صوت اليوم

سجل مجانًا

هل لديك حساب بالفعل؟ تسجيل الدخول

Last Updates

شعار Fish Audio على خلفية متدرجة من الأرجواني إلى الأحمر مع قصاصات ملونة، يعلن عن جولة تمويل أولية بقيمة 52 مليون دولار ويشكر أكثر من 8 ملايين منشئ.

27 يوليو 2026شركة

5 نماذج، 22 شخصاً، عام واحد

Rissa CaoCEO

كيف جعلنا S2.1 Pro مجانياً — إعادة بناء حزمة الاستدلال من الصفر

23 يوليو 2026أبحاث

كيف جعلنا واجهة برمجة تطبيقات تحويل النص إلى كلام مجانية: الهندسة البرمجية للاستدلال خلف S2.1 Pro

Shijia LiaoChief Scientist

Alex Lee: Using AI Voice to Build More Human Characters

20 يوليو 2026أضواء على المبدعين

Alex Lee: Using AI Voice to Build More Human Characters

Fish Audio CommunityFish Audio Community Team

Recommended

شعار Fish Audio على خلفية متدرجة من الأرجواني إلى الأحمر مع قصاصات ملونة، يعلن عن جولة تمويل أولية بقيمة 52 مليون دولار ويشكر أكثر من 8 ملايين منشئ.

27 يوليو 2026شركة

5 نماذج، 22 شخصاً، عام واحد

Rissa CaoCEO

كيف جعلنا S2.1 Pro مجانياً — إعادة بناء حزمة الاستدلال من الصفر

23 يوليو 2026أبحاث

كيف جعلنا واجهة برمجة تطبيقات تحويل النص إلى كلام مجانية: الهندسة البرمجية للاستدلال خلف S2.1 Pro

Shijia LiaoChief Scientist

Fish Audio S2.1-Pro — متاح الآن لجميع المطورين ومجاني

23 يونيو 2026أبحاث

Fish Audio S2.1 Pro: واجهة برمجة تطبيقات مجانية لتحويل النص إلى كلام للمطورين

Shijia LiaoChief Scientist

الاستنساخ الاحترافي للصوت في Fish Audio — نسخة آلية موثقة وبجودة الاستوديو لصوتك

15 يونيو 2026دليل

الاستنساخ الاحترافي للصوت: نسخة موثقة وبجودة الاستوديو لصوتك

Sabrina ShuSupport & Marketing Specialist

تصميم الصوت بالذكاء الاصطناعي على Fish Audio - حول مطالبة نصية إلى صوت مخصص

13 يونيو 2026دليل

تصميم الصوت بالذكاء الاصطناعي: أنشئ صوتاً مخصصاً من مطالبة نصية واحدة

Sabrina ShuSupport & Marketing Specialist

غلاف مدونة بخلفية لوحة زيتية انطباعية مجردة بألوان كريمية وخوخية دافئة. عنوان في الزاوية العلوية اليسرى 'لقد أجرينا اختباراً أعمى لتقنية TTS الخاصة بنا ضد كل منافس رئيسي' مع صف من أربع بطاقات زجاجية مثلجة أدناه تعرض درجات Bradley-Terry: Fish Audio S2 Pro عند 3.07 مع معدل فوز 66%، و Fish Audio S1، و ElevenLabs V3، و Inworld.

5 أبريل 2026أبحاث

لقد أجرينا اختباراً أعمى لتقنية TTS الخاصة بنا ضد جميع المنافسين الرئيسيين. إليكم النتائج.

Shijia LiaoChief Scientist