Audio Diffusion Models
Nov 30, 2024

Takeaways
- We launch Fish Diffusion, an open-source framework for audio generation
- Fish Diffusion is useful for TTS, SVC, and SVS
GitHub: https://github.com/fishaudio/fish-diffusion
Core Principle
At its core, the repository is built around modularity:
- Acoustic models should be swappable (diffusion, Grad-TTS-style, GAN-based).
- Conditioning signals (text, speaker, pitch, energy) should be modular.
A unified modeling stack
The architectures in the repo all share similar patterns:
- They take in structured batches with keys like contents, speaker, pitches, energy, and lengths.
- They build masks from sequence lengths to avoid computing loss on padding.
- They produce either spectrograms (for diffusion models) or raw waveforms (for GAN models).
Diffusion-based models (like the DiffSinger/GradTTS paths) focus on generating mel-spectrograms conditioned on a fused representation of text and prosody. HiFiSinger-style models go straight for waveforms, relying on discriminators to enforce realism. Despite these differences, they are tied together by the same configuration and training abstractions.
Modular conditioning and registries
Fish Diffusion treats encoders and vocoders as pluggable components. Text encoders, speaker encoders, pitch encoders, and energy encoders are all built through registries, so swapping from one feature extractor or vocoder to another is mostly a config change.
This makes the repo well-suited for:
- Multi-speaker and voice cloning setups
- Prosody-heavy tasks (singing, emotional speech)
- Rapid experimentation with different front-end feature stacks
The same philosophy applies to diffusion models, schedulers, and optimizers, which are also constructed from registry-based builders.
Try Our Latest Frontier Audio Model
You can try OpenAudio S1 today:
- Fish Audio Playground (S1): https://fish.audio
- S1-mini on Hugging Face: https://huggingface.co/fishaudio/openaudio-s1-mini