Fish Audio Releases Fish Audio S2: A New Generation of Text-to-Speech (TTS) with Intuitive Controlled Emotion

admin 4 hours ago

0 0 4 minutes read

Fish Audio Releases Fish Audio S2: A New Generation of Text-to-Speech (TTS) with Intuitive Controlled Emotion

The Text-to-Speech (TTS) landscape ranges from modular pipelines to large integrated audio models (LAMs). Fish Audio’s release of the S2-Pro, the flagship model within the Fish Speech ecosystem, represents the transition to open architectures capable of high-fidelity, multi-speaker integration with sub-150ms latency. The release provides a framework for combining implicit voice and granular emotion control using the Dual-Auto-Regressive (AR) method.

Architecture: Dual-AR Framework and RVQ

The main technical difference in the Fish Audio S2-Pro is its dual AR design. Traditional TTS models often struggle with the trade-off between sequence length and acoustic detail. The S2-Pro addresses this by dividing the production process into two special stages: the ‘Slow AR’ model and the ‘Fast AR’ model.

Slow AR Model (4B Parameters): This part works on the time axis. It is responsible for processing language input and generating semantic tokens. Using a large number of parameters (about 4 billion), the Slow AR model captures long-range dependencies, prosody, and structural nuances of speech.
Quick AR Model (400M Parameters): This part deals with acoustic dimensions. It predicts residual codebooks for each semantic token. This compact, fast model ensures that the high-frequency details of the sound—timbre, breathiness, and texture—are rendered with high efficiency.

This program relies on it Residual Vector Quantization (RVQ). In this setup, the raw audio is compressed into separate tokens across multiple layers (codebooks). The first layer captures the main acoustic features, while subsequent layers capture the ‘residues’ or defects left over from the previous layer. This allows the model to reproduce high-fidelity 44.1kHz audio while maintaining a manageable token count for the Transformer design.

Controlling Emotions by Learning In-Context and Intrinsic Markers

The Fish Audio S2-Pro achieves what the developers describe as ‘mindless controllable emotion’ by using two main methods: implicit in-context learning and natural language line control.

In-Context Learning (ICL):

Unlike older generations of TTS that required precise tuning to mimic a specific voice, S2-Pro uses Transformer’s ability to read within content. By providing a reference audio clip—ideally between 10 and 30 seconds—the model extracts the speaker’s identity and emotional state. The model treats this reference as a starting point in its content window, allowing it to continue the “sequence” of the same voice and style.

Inline Control Tags:

The model supports dynamic emotional change within the passage of a single generation. Because the model was trained on data containing descriptive language tags, developers can insert natural language tags directly into text recognition. For example:

[whisper] I have a secret [laugh] that I cannot tell you.

The model interprets these tags as instructions to modify acoustic tokens in real time, adjusting pitch, intensity, and rhythm without requiring a separate neural input or external control vector.

SGlang performance and integration benchmarks

To integrate TTS into real-time applications, the main limitation is ‘Time to First Audio’ (TTFA). Fish Audio S2-Pro is optimized for sub-150ms latency, with benchmarks on NVIDIA H200 hardware reaching around 100ms.

Several technical settings affect this performance:

SGlang and RadixAttention: S2-Pro is designed to work with SGlang, a highly efficient deployment framework. It uses RadixAttentionwhich allows for efficient management of the Key-Value (KV) cache. In a production environment where the same “main” voice information (reference clip) is used repeatedly, RadixAttention stores the prefix KV. This eliminates the need to recalculate the reference sound for every request, greatly reducing prefill time.
Multi-Speaker Single-Pass Generation: The architecture allows multiple speakers to exist within the same context window. This allows the production of complex dialogues or multi-character narrations in a single speakerphone, avoiding the extra delay of changing models or reloading different speaker weights.

Use of Technology and Data Measurement

The Fish Speech repository provides a Python-based implementation using PyTorch. The model was trained on a diverse dataset consisting of over 300,000 hours of multilingual audio. This scale is what allows the robust performance of the model across different languages and its ability to handle ‘non-speech’ expressions such as sighs or hesitations.

The training pipeline includes:

VQ-GAN training: Training the quantizer to map the sound to a hidden layer.
LLM training: Training adaptive Dual-ARs to predict those subtle tokens based on text and acoustic cues.

The VQ-GAN used in the S2-Pro is specifically tuned to reduce artifacts during the decoding process, ensuring that even at high compression ratios, the reconstructed audio remains ‘transparent’ (indistinguishable from the source to the human ear).

Key Takeaways

Dual-AR Architecture (Slow/Fast): Unlike single-stage models, the S2-Pro divides the functions between a 4B parameter ‘Slow AR’ model (grammatical and prosodic) and a 400M parameter for ‘Fast AR’ model (by improving the acoustic), improves both detail and speed.
Sub-150ms Latency: Designed for real-time conversational AI, the model achieves a Time-to-First-Audio (TTFA) of ~100ms on high-end hardware, making it suitable for live agents and interactive applications.
Hierarchical RVQ coding: By using Residual Vector Quantizationthe system compresses 44.1kHz audio to separate tokens across multiple layers. This allows the model to reconstruct complex vocal textures—including breaths and moans—without the added bloat of blue waves.
Reading Content of Incoming Shots: Developers can combine voice and its emotional state by providing a 10–30 sec reference clip. The model takes this as a starting point, accepting the speaker’s timbre and prosody without requiring further adjustment.
RadixAttention & SGlang Integration: Designed for production, S2-Pro services RadixAttention KV storage for voice commands. This allows for almost instantaneous production when using the same speaker over and over again, greatly reducing the overhead of pre-filling.

Check it out Model Card again Repo. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.