期間限定オファー- 年間50%OFF利用する
2025年11月20日Research

Audio Diffusion Models

Audio Diffusion Models

Takeaways

  • We launch Fish Diffusion, an open-source framework for audio generation
  • Fish Diffusion is useful for TTS, SVC, and SVS

GitHub: https://github.com/fishaudio/fish-diffusion

Core Principle

At its core, the repository is built around modularity:

  1. Acoustic models should be swappable (diffusion, Grad-TTS-style, GAN-based).
  2. Conditioning signals (text, speaker, pitch, energy) should be modular.

A unified modeling stack

The architectures in the repo all share similar patterns:

  • They take in structured batches with keys like contents, speaker, pitches, energy, and lengths.
  • They build masks from sequence lengths to avoid computing loss on padding.
  • They produce either spectrograms (for diffusion models) or raw waveforms (for GAN models).

Diffusion-based models (like the DiffSinger/GradTTS paths) focus on generating mel-spectrograms conditioned on a fused representation of text and prosody. HiFiSinger-style models go straight for waveforms, relying on discriminators to enforce realism. Despite these differences, they are tied together by the same configuration and training abstractions.

Modular conditioning and registries

Fish Diffusion treats encoders and vocoders as pluggable components. Text encoders, speaker encoders, pitch encoders, and energy encoders are all built through registries, so swapping from one feature extractor or vocoder to another is mostly a config change.

This makes the repo well-suited for:

  • Multi-speaker and voice cloning setups
  • Prosody-heavy tasks (singing, emotional speech)
  • Rapid experimentation with different front-end feature stacks

The same philosophy applies to diffusion models, schedulers, and optimizers, which are also constructed from registry-based builders.

Try Our Latest Frontier Audio Model

You can try OpenAudio S1 today:

Shijia Liao

Shijia LiaoX

Lengyue is the founder of Fish Audio and a cracked researcher pushing breakthroughs in Voice AI. Follow his work at @lengyuematrix.

Shijia Liaoの他の記事を読む

リアルに感じる声を作成する

今日から最高品質のオーディオを生成し始めましょう。

すでにアカウントをお持ちですか? ログイン