Audio Diffusion Models

Takeaways

We launch Fish Diffusion, an open-source framework for audio generation
Fish Diffusion is useful for TTS, SVC, and SVS

GitHub: https://github.com/fishaudio/fish-diffusion

Core Principle

At its core, the repository is built around modularity:

Acoustic models should be swappable (diffusion, Grad-TTS-style, GAN-based).
Conditioning signals (text, speaker, pitch, energy) should be modular.

A unified modeling stack

The architectures in the repo all share similar patterns:

They take in structured batches with keys like contents, speaker, pitches, energy, and lengths.
They build masks from sequence lengths to avoid computing loss on padding.
They produce either spectrograms (for diffusion models) or raw waveforms (for GAN models).

Diffusion-based models (like the DiffSinger/GradTTS paths) focus on generating mel-spectrograms conditioned on a fused representation of text and prosody. HiFiSinger-style models go straight for waveforms, relying on discriminators to enforce realism. Despite these differences, they are tied together by the same configuration and training abstractions.

Modular conditioning and registries

Fish Diffusion treats encoders and vocoders as pluggable components. Text encoders, speaker encoders, pitch encoders, and energy encoders are all built through registries, so swapping from one feature extractor or vocoder to another is mostly a config change.

This makes the repo well-suited for:

Multi-speaker and voice cloning setups
Prosody-heavy tasks (singing, emotional speech)
Rapid experimentation with different front-end feature stacks

The same philosophy applies to diffusion models, schedulers, and optimizers, which are also constructed from registry-based builders.

Try Our Latest Frontier Audio Model

You can try OpenAudio S1 today:

Fish Audio Playground (S1): https://fish.audio
S1-mini on Hugging Face: https://huggingface.co/fishaudio/openaudio-s1-mini

Takeaways

Core Principle

A unified modeling stack

Modular conditioning and registries

Try Our Latest Frontier Audio Model

Create voices that feel real