Launching Fish Audio S1: A Frontier Text-to-Speech Audio Foundation Model
Oct 23, 2025

Takeaways
- We launch Fish Audio S1, a frontier text-to-speech audio foundation model.
- Fish Audio S1 is trained on over 2 million hours of audio with online RLHF (GRPO).
- Fish Audio S1 achieves 0.8% WER and 0.4% CER on Seed TTS Eval.
- S1 supports open-domain emotion, tone, and special effect markers.
Try S1 Now
Try the model for free at Fish Audio: https://fish.audio/app/text-to-speech/
Hugging Face model page: https://huggingface.co/fishaudio/openaudio-s1-mini
Fish Audio S1
S1 comes in two variants:
- S1 (4B) – full-featured flagship model, available on the Fish Audio Playground
- S1-mini (0.5B) – a distilled version for resource-constrained environments, available on Hugging Face
Both models are trained with online RLHF (GRPO) using in-house reward models.
State-of-the-Art Voice Quality
OpenAudio S1 is trained on over 2 million hours of audio, combining large-scale text–audio pairs with rich supervision. By jointly modeling semantic and acoustic information in a single model, S1 avoids the information loss typical of “semantic-only” pipelines and reduces artifacts and word errors.
On Seed TTS Eval (with GPT-4o-based transcription and pyannote-based speaker metrics), S1 achieves:
- WER: 0.008
- CER: 0.004
S1-mini follows closely with:
- WER: 0.011
- CER: 0.005
OpenAudio S1 also reaches the top ELO score on HuggingFace TTS-Arena-V2, ranking #1 in human subjective evaluation for naturalness, intelligibility, and similarity.
Voice-Actor-Level Control
Fish Audio S1 enables fine-grained control over emotion and delivery. We trained our own speech-to-text model (to be released soon) to caption audio with emotion, tone, speaker tags, and events, then used it to annotate over 100k hours of audio for instruction-following.
You can guide S1 with emotion markers like (angry), (sad), (in a hurry), (chuckling), and more. Check out the full list of recommended emotion tags here: https://docs.fish.audio/developer-guide/core-features/emotions
Global, Multilingual Voices
OpenAudio S1 is designed for global reach. It supports a wide range of languages, including:
English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, Portuguese
You can mix languages in the same prompt, and the model will adapt naturally to the script and context.
Architecture, Speed, and Cost
Under the hood, OpenAudio S1:
- Uses the Qwen3 architecture as a multimodal backbone
- Employs an in-house audio codec similar in spirit to Descript Audio Codec, trained from scratch
- Uses online RLHF with GRPO to optimize for human preferences
With torch compile and optimized inference, S1 runs at about 1:7 real-time factor on an NVIDIA RTX 4090, making it practical for interactive applications.
On the pricing side, S1 is designed to be truly accessible:
- Around $15 per million bytes, roughly $0.8 per hour of audio
This makes high-quality TTS viable even for high-volume or budget-sensitive workloads.
- Zero-shot & few-shot voice cloning from short samples
- Multilingual and cross-lingual TTS
- No phoneme dependency, handling arbitrary scripts directly from text
Get Started with OpenAudio S1
You can try OpenAudio S1 today:
- Fish Audio Playground (S1): https://fish.audio
- S1-mini on Hugging Face: https://huggingface.co/fishaudio/openaudio-s1-mini