Meet ‘Kani-TTS-2’: A 400M Param Open Source-to-Speech Model Powered by 3GB VRAM and Speech Synthesis Support

0 1 3 minutes read

Meet ‘Kani-TTS-2’: A 400M Param Open Source-to-Speech Model Powered by 3GB VRAM and Speech Synthesis Support

The quality of the sound produced changes to efficiency. A new open source competitor, Kani-TTS-2released by the group on nineninesix.ai. This model marks a departure from the heavy, expensive TTS systems. Instead, it treats sound like a language, delivering high-fidelity speech synthesis in a remarkably small footprint.

Kani-TTS-2 offers a leaner, more efficient alternative to closed source APIs. Currently available on Hugging Face in both English (EN) and Portuguese (PT) translations.

Architecture: LFM2 and NanoCodec

Kani-TTS-2 follows the ‘Sound-like language‘philosophy. The model does not use traditional mel-spectrogram pipelines. Instead, it converts the raw audio into separate tokens using a neural codec.

The program relies on a two-stage process:

Language Core: The model is built on it LiquidAI’s LFM2 (350M) properties. This core generates a ‘sound target’ by predicting the next sound tokens. Because LFM (Liquid Foundation Models) are designed for efficiency, they provide a fast alternative to conventional transformers.
Neural Codec: It uses the NVIDIA NanoCodec converting those tokens into 22kHz waveforms.

Using this structure, the model captures human-like prosody—the rhythm and pitch of speech—without the ‘robotic’ artifacts found in older TTS systems.

Efficiency: 10,000 hours in 6 hours

Kani-TTS-2 training metrics are expert class in full use. An English model was trained on it 10,000 hours of high-quality speech data.

While that scale is impressive, training speed is the real deal. The research team only trained the model 6 hours using this collection 8 NVIDIA H100 GPUs. This proves that large datasets no longer require weeks of computation time when paired with efficient architectures such as LFM2.

Zero-Shot Voice Cloning and Performance

The outstanding feature of the developers is zero-shot voice cloning. Unlike traditional models that require fine tuning of new voices, Kani-TTS-2 uses speaker embedding.

How does this work: He provides a short reference audio clip.
Result: The model extracts the unique features of that voice and applies them to the generated text on the fly.

From a shipping point of view, the model is very accessible:

Parameter calculation: 400M (0.4B) parameters.
Speed: It includes a Real-Time Factor (RTF) of 0.2. This means that it can produce 10 seconds of speech in about 2 seconds.
Hardware: It only needs 3GB of VRAMmaking it compatible with consumer grade GPUs like the RTX 3060 or 4050.
License: Issued under Apache 2.0 license, which allows commercial use.

Key Takeaways

Functional Architecture: The model uses a 400M parameter backbone based LiquidAI’s LFM2 (350M). This ‘Sound-as-language’ approach treats speech as discrete tokens, allowing faster processing and more human-like pronunciation compared to traditional architectures.
Fast Training at Scale: Kani-TTS-2-EN was trained on it 10,000 hours of simply high-quality speech data 6 hours using 8 NVIDIA H100 GPUs.
Instant Zero-Shot Cloning: There is no need to fine-tune repeating a certain word. By providing a short reference audio clip, the model uses speaker embedding to quickly merge text into the target speaker’s voice.
High Performance on Edge Hardware: With Real-Time Factor (RTF) of 0.2the model can produce 10 seconds of sound in about 2 seconds. It only needs 3GB of VRAMwhich makes it fully functional on consumer-grade GPUs like the RTX 3060.
Developer Friendly License: Issued under Apache License 2.0Kani-TTS-2 is ready for commercial integration. It provides an early, low-latency alternative to expensive closed-source TTS APIs.

Check it out Model weight. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.