Meet ‘Kani-TTS-2’: A 400M Param Open Source-to-Speech Model Powered by 3GB VRAM and Speech Synthesis Support

The quality of the sound produced changes to efficiency. A new open source competitor, Kani-TTS-2released by the group on nineninesix.ai. This model marks a departure from the heavy, expensive TTS systems. Instead, it treats sound like a language, delivering high-fidelity speech synthesis in a remarkably small footprint.
Kani-TTS-2 offers a leaner, more efficient alternative to closed source APIs. Currently available on Hugging Face in both English (EN) and Portuguese (PT) translations.
Architecture: LFM2 and NanoCodec
Kani-TTS-2 follows the ‘Sound-like language‘philosophy. The model does not use traditional mel-spectrogram pipelines. Instead, it converts the raw audio into separate tokens using a neural codec.
The program relies on a two-stage process:
- Language Core: The model is built on it LiquidAI’s LFM2 (350M) properties. This core generates a ‘sound target’ by predicting the next sound tokens. Because LFM (Liquid Foundation Models) are designed for efficiency, they provide a fast alternative to conventional transformers.
- Neural Codec: It uses the NVIDIA NanoCodec converting those tokens into 22kHz waveforms.
Using this structure, the model captures human-like prosody—the rhythm and pitch of speech—without the ‘robotic’ artifacts found in older TTS systems.
Efficiency: 10,000 hours in 6 hours
Kani-TTS-2 training metrics are expert class in full use. An English model was trained on it 10,000 hours of high-quality speech data.
While that scale is impressive, training speed is the real deal. The research team only trained the model 6 hours using this collection 8 NVIDIA H100 GPUs. This proves that large datasets no longer require weeks of computation time when paired with efficient architectures such as LFM2.
Zero-Shot Voice Cloning and Performance
The outstanding feature of the developers is zero-shot voice cloning. Unlike traditional models that require fine tuning of new voices, Kani-TTS-2 uses speaker embedding.
- How does this work: He provides a short reference audio clip.
- Result: The model extracts the unique features of that voice and applies them to the generated text on the fly.
From a shipping point of view, the model is very accessible:
- Parameter calculation: 400M (0.4B) parameters.
- Speed: It includes a Real-Time Factor (RTF) of 0.2. This means that it can produce 10 seconds of speech in about 2 seconds.
- Hardware: It only needs 3GB of VRAMmaking it compatible with consumer grade GPUs like the RTX 3060 or 4050.
- License: Issued under Apache 2.0 license, which allows commercial use.
Key Takeaways
- Functional Architecture: The model uses a 400M parameter backbone based LiquidAI’s LFM2 (350M). This ‘Sound-as-language’ approach treats speech as discrete tokens, allowing faster processing and more human-like pronunciation compared to traditional architectures.
- Fast Training at Scale: Kani-TTS-2-EN was trained on it 10,000 hours of simply high-quality speech data 6 hours using 8 NVIDIA H100 GPUs.
- Instant Zero-Shot Cloning: There is no need to fine-tune repeating a certain word. By providing a short reference audio clip, the model uses speaker embedding to quickly merge text into the target speaker’s voice.
- High Performance on Edge Hardware: With Real-Time Factor (RTF) of 0.2the model can produce 10 seconds of sound in about 2 seconds. It only needs 3GB of VRAMwhich makes it fully functional on consumer-grade GPUs like the RTX 3060.
- Developer Friendly License: Issued under Apache License 2.0Kani-TTS-2 is ready for commercial integration. It provides an early, low-latency alternative to expensive closed-source TTS APIs.
Check it out Model weight. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.




