20 نوفمبر 2025Research

Launching Fish Audio S1: A Frontier Text-to-Speech Audio Foundation Model

Zhizhuo Zhou, ML Researcher

Launching Fish Audio S1: A Frontier Text-to-Speech Audio Foundation Model

Takeaways

We launch Fish Audio S1, a frontier text-to-speech audio foundation model.
Fish Audio S1 is trained on over 2 million hours of audio with online RLHF (GRPO).
Fish Audio S1 achieves 0.8% WER and 0.4% CER on Seed TTS Eval.
S1 supports open-domain emotion, tone, and special effect markers.

Try S1 Now

Try the model for free at Fish Audio: https://fish.audio/app/text-to-speech/

Hugging Face model page: https://huggingface.co/fishaudio/openaudio-s1-mini

Fish Audio S1

S1 comes in two variants:

S1 (4B) – full-featured flagship model, available on the Fish Audio Playground
S1-mini (0.5B) – a distilled version for resource-constrained environments, available on Hugging Face

Both models are trained with online RLHF (GRPO) using in-house reward models.

State-of-the-Art Voice Quality

OpenAudio S1 is trained on over 2 million hours of audio, combining large-scale text–audio pairs with rich supervision. By jointly modeling semantic and acoustic information in a single model, S1 avoids the information loss typical of “semantic-only” pipelines and reduces artifacts and word errors.

On Seed TTS Eval (with GPT-4o-based transcription and pyannote-based speaker metrics), S1 achieves:

WER: 0.008
CER: 0.004

S1-mini follows closely with:

WER: 0.011
CER: 0.005

OpenAudio S1 also reaches the top ELO score on HuggingFace TTS-Arena-V2, ranking #1 in human subjective evaluation for naturalness, intelligibility, and similarity.

Voice-Actor-Level Control

Fish Audio S1 enables fine-grained control over emotion and delivery. We trained our own speech-to-text model (to be released soon) to caption audio with emotion, tone, speaker tags, and events, then used it to annotate over 100k hours of audio for instruction-following.

You can guide S1 with emotion markers like (angry), (sad), (in a hurry), (chuckling), and more. Check out the full list of recommended emotion tags here: https://docs.fish.audio/developer-guide/core-features/emotions

Global, Multilingual Voices

OpenAudio S1 is designed for global reach. It supports a wide range of languages, including:

English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, Portuguese

You can mix languages in the same prompt, and the model will adapt naturally to the script and context.

Architecture, Speed, and Cost

Under the hood, OpenAudio S1:

Uses the Qwen3 architecture as a multimodal backbone
Employs an in-house audio codec similar in spirit to Descript Audio Codec, trained from scratch
Uses online RLHF with GRPO to optimize for human preferences

With torch compile and optimized inference, S1 runs at about 1:7 real-time factor on an NVIDIA RTX 4090, making it practical for interactive applications.

On the pricing side, S1 is designed to be truly accessible:

Around $15 per million bytes, roughly$ 0.8 per hour of audio

This makes high-quality TTS viable even for high-volume or budget-sensitive workloads.

Zero-shot & few-shot voice cloning from short samples
Multilingual and cross-lingual TTS
No phoneme dependency, handling arbitrary scripts directly from text

Get Started with OpenAudio S1

You can try OpenAudio S1 today:

Fish Audio Playground (S1): https://fish.audio
S1-mini on Hugging Face: https://huggingface.co/fishaudio/openaudio-s1-mini

Zhizhuo Zhou

Z is a co-founder of Fish Audio and gigachad AI researcher at Stanford focusing on diffusion and 3D generative models. Find him as a barista bartender at exclusive popups, and see his work at zhiz.dev.

اقرأ المزيد من Zhizhuo Zhou