Technology & AI

Meet ‘Kani-TTS-2’: A 400M Param Open Source-to-Speech Model Powered by 3GB VRAM and Speech Synthesis Support





The quality of the sound produced changes to efficiency. A new open source competitor, Kani-TTS-2released by the group on nineninesix.ai. This model marks a departure from the heavy, expensive TTS systems. Instead, it treats sound like a language, delivering high-fidelity speech synthesis in a remarkably small footprint.

Kani-TTS-2 offers a leaner, more efficient alternative to closed source APIs. Currently available on Hugging Face in both English (EN) and Portuguese (PT) translations.

Architecture: LFM2 and NanoCodec

Kani-TTS-2 follows the ‘Sound-like language‘philosophy. The model does not use traditional mel-spectrogram pipelines. Instead, it converts the raw audio into separate tokens using a neural codec.

The program relies on a two-stage process:

  1. Language Core: The model is built on it LiquidAI’s LFM2 (350M) properties. This core generates a ‘sound target’ by predicting the next sound tokens. Because LFM (Liquid Foundation Models) are designed for efficiency, they provide a fast alternative to conventional transformers.
  2. Neural Codec: It uses the NVIDIA NanoCodec converting those tokens into 22kHz waveforms.

Using this structure, the model captures human-like prosody—the rhythm and pitch of speech—without the ‘robotic’ artifacts found in older TTS systems.

Efficiency: 10,000 hours in 6 hours

Kani-TTS-2 training metrics are expert class in full use. An English model was trained on it 10,000 hours of high-quality speech data.

While that scale is impressive, training speed is the real deal. The research team only trained the model 6 hours using this collection 8 NVIDIA H100 GPUs. This proves that large datasets no longer require weeks of computation time when paired with efficient architectures such as LFM2.

Zero-Shot Voice Cloning and Performance

The outstanding feature of the developers is zero-shot voice cloning. Unlike traditional models that require fine tuning of new voices, Kani-TTS-2 uses speaker embedding.

  • How does this work: He provides a short reference audio clip.
  • Result: The model extracts the unique features of that voice and applies them to the generated text on the fly.

From a shipping point of view, the model is very accessible:

  • Parameter calculation: 400M (0.4B) parameters.
  • Speed: It includes a Real-Time Factor (RTF) of 0.2. This means that it can produce 10 seconds of speech in about 2 seconds.
  • Hardware: It only needs 3GB of VRAMmaking it compatible with consumer grade GPUs like the RTX 3060 or 4050.
  • License: Issued under Apache 2.0 license, which allows commercial use.

Key Takeaways

  • Functional Architecture: The model uses a 400M parameter backbone based LiquidAI’s LFM2 (350M). This ‘Sound-as-language’ approach treats speech as discrete tokens, allowing faster processing and more human-like pronunciation compared to traditional architectures.
  • Fast Training at Scale: Kani-TTS-2-EN was trained on it 10,000 hours of simply high-quality speech data 6 hours using 8 NVIDIA H100 GPUs.
  • Instant Zero-Shot Cloning: There is no need to fine-tune repeating a certain word. By providing a short reference audio clip, the model uses speaker embedding to quickly merge text into the target speaker’s voice.
  • High Performance on Edge Hardware: With Real-Time Factor (RTF) of 0.2the model can produce 10 seconds of sound in about 2 seconds. It only needs 3GB of VRAMwhich makes it fully functional on consumer-grade GPUs like the RTX 3060.
  • Developer Friendly License: Issued under Apache License 2.0Kani-TTS-2 is ready for commercial integration. It provides an early, low-latency alternative to expensive closed-source TTS APIs.

Check it out Model weight. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.


Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.






Previous articleGetting started with OpenClaw and connecting it with WhatsApp


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button