期間限定オファー- 年間50％OFF30m 00s利用する

ソリューション

エンタープライズ

営業に問い合わせ

2025年11月20日Research

Audio Diffusion Models

Shijia Liao, Chief Scientist

Audio Diffusion Models

Takeaways

We launch Fish Diffusion, an open-source framework for audio generation
Fish Diffusion is useful for TTS, SVC, and SVS

GitHub: https://github.com/fishaudio/fish-diffusion

Core Principle

At its core, the repository is built around modularity:

Acoustic models should be swappable (diffusion, Grad-TTS-style, GAN-based).
Conditioning signals (text, speaker, pitch, energy) should be modular.

A unified modeling stack

The architectures in the repo all share similar patterns:

They take in structured batches with keys like contents, speaker, pitches, energy, and lengths.
They build masks from sequence lengths to avoid computing loss on padding.
They produce either spectrograms (for diffusion models) or raw waveforms (for GAN models).

Diffusion-based models (like the DiffSinger/GradTTS paths) focus on generating mel-spectrograms conditioned on a fused representation of text and prosody. HiFiSinger-style models go straight for waveforms, relying on discriminators to enforce realism. Despite these differences, they are tied together by the same configuration and training abstractions.

Modular conditioning and registries

Fish Diffusion treats encoders and vocoders as pluggable components. Text encoders, speaker encoders, pitch encoders, and energy encoders are all built through registries, so swapping from one feature extractor or vocoder to another is mostly a config change.

This makes the repo well-suited for:

Multi-speaker and voice cloning setups
Prosody-heavy tasks (singing, emotional speech)
Rapid experimentation with different front-end feature stacks

The same philosophy applies to diffusion models, schedulers, and optimizers, which are also constructed from registry-based builders.

Try Our Latest Frontier Audio Model

You can try OpenAudio S1 today:

Fish Audio Playground (S1): https://fish.audio
S1-mini on Hugging Face: https://huggingface.co/fishaudio/openaudio-s1-mini

Founder & Chief-Scientist of Fish Audio.

Shijia Liaoの他の記事を読む

リアルに感じる声を作成する

今日から最高品質のオーディオを生成し始めましょう。

無料でサインアップ

すでにアカウントをお持ちですか？ログイン

Last Updates

Best AI 3D Model Generators for Game Developers and Creators

2026年6月8日Info

Best AI 3D Model Generators for Game Developers and Creators

Kevin YoungDigital Marketing Specialist

AIコーディングエージェント向けのエージェント対応ドキュメント：llms.txt、MCPサーバー、およびSkills

2026年5月29日ガイド

AIコーディングエージェントのためのFish Audio：llms.txt、MCP、およびSkills

Sabrina ShuSupport & Marketing Specialist

Fish Audio AIボイスチェンジャー — コンテンツクリエイターのための完全ガイド

2026年4月22日ガイド

AIボイスチェンジャーの使い方 — コンテンツクリエイターのための完全ガイド

Sabrina ShuSupport & Marketing Specialist

Recommended

Blog cover with abstract impressionist oil painting background in warm cream and peach tones. Upper-left headline 'We blind-tested our TTS against every major competitor' with a row of four frosted glass cards below showing Bradley-Terry scores: Fish Audio S2 Pro at 3.07 with 66% win rate, Fish Audio S1, ElevenLabs V3, and Inworld.

2026年4月5日Research

自社TTSを主要な全競合他社とブラインドテストしました。その結果を公開します。

Shijia LiaoChief Scientist

Fish Audio STT — Fish Audioでポッドキャストを文字起こしする

2026年3月27日ガイド

ポッドキャスト文字起こしツール — Fish Audioでポッドキャストを文字起こしする方法

Sabrina ShuSupport & Marketing Specialist

クリエイティブチームに最適なAI TTS！Fish Audioチームプランの解説

2026年3月19日ガイド

クリエイティブチームに最適なAI TTS！Fish Audioのチームプランを徹底解説

Sabrina ShuSupport & Marketing Specialist

Fish Audio S2のインラインタグと単語レベルの音声制御の概要

2026年3月12日ガイド

Fish Audio S2！単語レベルでの細やかなAI音声制御

Sabrina ShuSupport & Marketing Specialist

Fish Audio S2 の概要

2026年3月9日リサーチ

Fish Audio が S2 をオープンソース化：きめ細かな制御とプロダクション・ストリーミングの両立

Shijia LiaoChief Scientist

SAM Audioを使って音源分離を行う方法：ステップ・バイ・ステップ・ガイド

2026年1月30日チュートリアル

SAM Audioを使って音源分離を行う方法：ステップ・バイ・ステップ・ガイド

James DingML Engineer