What does this fork signal mean?

SiliconFlow forked siliconflow/ComfyUI-FishAudioS2 (forked from Saganaki22/ComfyUI-FishAudioS2). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo siliconflow/ComfyUI-FishAudioS2 · parent Saganaki22/ComfyUI-FishAudioS2 · ComfyUI custom node for SiliconFlow's FishAudio S2 audio generation model.. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

SiliconFlow Fork: siliconflow/ComfyUI-FishAudioS2

Captured source

source ↗

GitHub/github.com/siliconflow/ComfyUI-FishAudioS2

siliconflow/ComfyUI-FishAudioS2 repository metadata

Source ↗

published Mar 30, 2026seen Jun 5captured Jun 11http 200method plain

siliconflow/ComfyUI-FishAudioS2

Description: ComfyUI custom nodes for Fish Audio S2-Pro TTS — voice clone, multi-speaker, and text-to-speech

License: NOASSERTION

Stars: 0

Forks: 0

Open issues: 1

Created: 2026-03-30T05:44:20Z

Pushed: 2026-03-30T06:49:22Z

Default branch: main

Fork: yes

Parent repository: Saganaki22/ComfyUI-FishAudioS2

Archived: no

README:

---

https://github.com/user-attachments/assets/d69377a6-1c28-40d0-a61a-ba27237e6801

---

🎵 Overview

Fish Audio S2 Pro is a state-of-the-art text-to-speech model with fine-grained inline control of prosody and emotion. Trained on 10M+ hours of audio data across 83 languages with 1500+ emotive tags, it combines reinforcement learning alignment with a Dual-Autoregressive architecture for speech that sounds natural, realistic, and emotionally rich.

Paper: Fish Audio S2 Technical Report (arXiv:2603.08823)

This ComfyUI wrapper provides native node-based integration with:

Zero-shot voice cloning from 10-30 second reference audio
Inline emotion/prosody control with [tag] syntax
Multi-speaker conversation synthesis in a single pass
Per-speaker audio isolation for multi-speaker lip sync workflows
83 language support with automatic detection

---

✨ Features

Zero-Shot Voice Cloning – Clone any voice from 10-30 seconds of reference audio
1500+ Emotive Tags – Fine-grained control with [laugh], [whisper], [excited], [sad], etc.
83 Languages – Full multilingual support without phoneme preprocessing
Multi-Speaker TTS – Generate conversations with multiple cloned voices in one pass
Per-Speaker Audio Isolation – Separate audio tracks for each speaker (lip sync workflows)
Native ComfyUI Integration – AUDIO noodle inputs, progress bars, interruption support
Optimized Performance – Support for bf16/fp16/fp32 dtypes, SDPA, FlashAttention, SageAttention
Smart Auto-Download – Model weights auto-downloaded from HuggingFace on first use
Smart Caching – Optional model caching with automatic unloading on config change

---

Requirements

GPU: NVIDIA GPU with 24GB+ VRAM for full model (RTX 3090/4090, A5000, etc.)
16GB+ VRAM works with BNB NF4 4-bit on-the-fly quantization (~10-11 it/s)
CPU/MPS: ~1.5-2 seconds per token (experimental)
18GB+ VRAM works with BNB INT8 on-the-fly quantization (~10-11 it/s)
20GB+ VRAM works with the FP8 quantized model (s2-pro-fp8, ~15 it/s, requires RTX 4090/5090 or Ada/Blackwell GPU)
CPU/MPS: ⚠️ EXPERIMENTAL
Python: 3.10+
CUDA: 11.8+ (for GPU inference)

> ⚠️ BNB On-the-Fly Quantization Requirements: > > BNB INT8 and BNB NF4 options use the s2-pro (bf16) model and quantize on-the-fly via bitsandbytes. > > Install bitsandbytes: > ``bash > pip install bitsandbytes > `` > > Note: BNB options run at ~10-11 it/s vs ~15 it/s for FP8. They work on any NVIDIA GPU without special hardware requirements.

---

Models

| Model | VRAM | Speed | Description | |-------|------|-------|-------------| | s2-pro | ~24GB | ~15-17 it/s | Full precision (4B params) — best quality, works out of the box. 15 it/s baseline, 17 it/s with SageAttention | | s2-pro-fp8 | ~20GB | ~15 it/s | FP8 weight-only quantized — recommended for 20GB+ Ada/Blackwell GPUs (RTX 4090/5090), no extra dependencies | | BNB INT8 | ~18GB | ~10-11 it/s | On-the-fly INT8 quantization via bitsandbytes — uses s2-pro model, requires bitsandbytes | | BNB NF4 | ~16GB | ~10-11 it/s | On-the-fly 4-bit NF4 quantization via bitsandbytes — uses s2-pro model, requires bitsandbytes |

Models are auto-downloaded from HuggingFace on first use:

fishaudio/s2-pro — full model
drbaph/s2-pro-fp8 — FP8 quantized

---

Tested Configurations

Tested and working v0.4.0 with PyTorch 2.10+cu13.

| | Standalone env | Shared ComfyUI env | FP8 (RTX 4090/5090) | |---|---|---|---| | Python | 3.10 – 3.13 | 3.10 – 3.13 | 3.10 – 3.13 | | PyTorch | 2.x + CUDA 11.8+ | managed by ComfyUI | 2.x + CUDA 11.8+ | | torchaudio | any (2.9+ supported) | any (2.9+ supported) | any (2.9+ supported) | | protobuf | any (not touched) | any (not touched) | any (not touched) | | descript-audio-codec | 1.0.0 (--no-deps) | 1.0.0 (--no-deps) | 1.0.0 (--no-deps) | | descript-audiotools | 0.7.2 (--no-deps) | 0.7.2 (--no-deps) | 0.7.2 (--no-deps) | | transformers | ≥4.45.2 | ≥4.45.2 | ≥4.45.2 | | bitsandbytes | optional (NF4/INT8) | optional (NF4/INT8) | not needed | | VRAM | 24GB+ / 16GB+ (BNB) | 24GB+ / 16GB+ (BNB) | 20GB+ (Ada/Blackwell) | | GPU | any NVIDIA | any NVIDIA | RTX 4090/5090 or Ada/Blackwell |

> As of v0.3.0, descript-audio-codec, descript-audiotools, and protobuf are never installed or modified by pip install -r requirements.txt. The two audio packages are auto-installed at first startup with --no-deps, leaving your environment's protobuf version untouched. > > As of v0.3.6, all transitive runtime dependencies of dac/audiotools (flatten-dict, importlib-resources, julius, randomname, ffmpy, argbind) are also auto-installed, fixing fresh-install failures on clean portable environments.

---

Installation

Click to expand installation methods

Method 1: ComfyUI Manager (Recommended)

1. Open ComfyUI Manager 2. Search for "FishAudioS2" 3. Click Install 4. Restart ComfyUI

Method 2: Manual Installation

cd ComfyUI/custom_nodes
git clone https://github.com/saganaki22/ComfyUI-FishAudioS2.git
cd ComfyUI-FishAudioS2
pip install -r requirements.txt

> Note: descript-audio-codec and descript-audiotools are not in requirements.txt on purpose — they are auto-installed by the node at ComfyUI startup with --no-deps to avoid their protobuf > If auto-install fails at startup, install them manually **with --no-deps** (omitting this flag can break other ComfyUI nodes that need protobuf 5.x): > bash > pip install descript-audio-codec --no-deps > pip install "descript-audiotools>=0.7.2" --no-deps >

> [!CAUTION] > Never run `pip install git+https://github.com/fishaudio/fish-speech` > fish-speech is already bundled inside this node. Running that command will downgrade PyTorch and other core packages, potentially breaking...

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine fork by same org, no novelty.