Audio Diffusion Models

Nov 30, 2024

lengyuelengyueResearch
Audio Diffusion Models

Takeaways

  • We launch Fish Diffusion, an open-source framework for audio generation
  • Fish Diffusion is useful for TTS, SVC, and SVS

GitHub: https://github.com/fishaudio/fish-diffusion

Core Principle

At its core, the repository is built around modularity:

  1. Acoustic models should be swappable (diffusion, Grad-TTS-style, GAN-based).
  2. Conditioning signals (text, speaker, pitch, energy) should be modular.

A unified modeling stack

The architectures in the repo all share similar patterns:

  • They take in structured batches with keys like contents, speaker, pitches, energy, and lengths.
  • They build masks from sequence lengths to avoid computing loss on padding.
  • They produce either spectrograms (for diffusion models) or raw waveforms (for GAN models).

Diffusion-based models (like the DiffSinger/GradTTS paths) focus on generating mel-spectrograms conditioned on a fused representation of text and prosody. HiFiSinger-style models go straight for waveforms, relying on discriminators to enforce realism. Despite these differences, they are tied together by the same configuration and training abstractions.

Modular conditioning and registries

Fish Diffusion treats encoders and vocoders as pluggable components. Text encoders, speaker encoders, pitch encoders, and energy encoders are all built through registries, so swapping from one feature extractor or vocoder to another is mostly a config change.

This makes the repo well-suited for:

  • Multi-speaker and voice cloning setups
  • Prosody-heavy tasks (singing, emotional speech)
  • Rapid experimentation with different front-end feature stacks

The same philosophy applies to diffusion models, schedulers, and optimizers, which are also constructed from registry-based builders.

Try Our Latest Frontier Audio Model

You can try OpenAudio S1 today:

Create voices that feel real

Start generating the highest quality audio today.

Already have an account? Log in