Launching Fish Audio S1: A Frontier Text-to-Speech Audio Foundation Model

Oct 23, 2025

zhizdevzhizdevResearch
Launching Fish Audio S1: A Frontier Text-to-Speech Audio Foundation Model

Takeaways

  • We launch Fish Audio S1, a frontier text-to-speech audio foundation model.
  • Fish Audio S1 is trained on over 2 million hours of audio with online RLHF (GRPO).
  • Fish Audio S1 achieves 0.8% WER and 0.4% CER on Seed TTS Eval.
  • S1 supports open-domain emotion, tone, and special effect markers.

Try S1 Now

Try the model for free at Fish Audio: https://fish.audio/app/text-to-speech/

Hugging Face model page: https://huggingface.co/fishaudio/openaudio-s1-mini

Fish Audio S1

S1 comes in two variants:

  • S1 (4B) – full-featured flagship model, available on the Fish Audio Playground
  • S1-mini (0.5B) – a distilled version for resource-constrained environments, available on Hugging Face

Both models are trained with online RLHF (GRPO) using in-house reward models.

State-of-the-Art Voice Quality

OpenAudio S1 is trained on over 2 million hours of audio, combining large-scale text–audio pairs with rich supervision. By jointly modeling semantic and acoustic information in a single model, S1 avoids the information loss typical of “semantic-only” pipelines and reduces artifacts and word errors.

On Seed TTS Eval (with GPT-4o-based transcription and pyannote-based speaker metrics), S1 achieves:

  • WER: 0.008
  • CER: 0.004

S1-mini follows closely with:

  • WER: 0.011
  • CER: 0.005

OpenAudio S1 also reaches the top ELO score on HuggingFace TTS-Arena-V2, ranking #1 in human subjective evaluation for naturalness, intelligibility, and similarity.

Voice-Actor-Level Control

Fish Audio S1 enables fine-grained control over emotion and delivery. We trained our own speech-to-text model (to be released soon) to caption audio with emotion, tone, speaker tags, and events, then used it to annotate over 100k hours of audio for instruction-following.

You can guide S1 with emotion markers like (angry), (sad), (in a hurry), (chuckling), and more. Check out the full list of recommended emotion tags here: https://docs.fish.audio/developer-guide/core-features/emotions

Global, Multilingual Voices

OpenAudio S1 is designed for global reach. It supports a wide range of languages, including:

English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, Portuguese

You can mix languages in the same prompt, and the model will adapt naturally to the script and context.

Architecture, Speed, and Cost

Under the hood, OpenAudio S1:

  • Uses the Qwen3 architecture as a multimodal backbone
  • Employs an in-house audio codec similar in spirit to Descript Audio Codec, trained from scratch
  • Uses online RLHF with GRPO to optimize for human preferences

With torch compile and optimized inference, S1 runs at about 1:7 real-time factor on an NVIDIA RTX 4090, making it practical for interactive applications.

On the pricing side, S1 is designed to be truly accessible:

  • Around $15 per million bytes, roughly $0.8 per hour of audio

This makes high-quality TTS viable even for high-volume or budget-sensitive workloads.

  • Zero-shot & few-shot voice cloning from short samples
  • Multilingual and cross-lingual TTS
  • No phoneme dependency, handling arbitrary scripts directly from text

Get Started with OpenAudio S1

You can try OpenAudio S1 today:

Create voices that feel real

Start generating the highest quality audio today.

Already have an account? Log in