Technology & AI

Miso Labs Releases MisoTTS: An 8B Dynamic Model for Open-Weighted Text-to-Speech

Miso Labs released MisoTTS, an open-source 8-billion-parameter text-to-speech model. Produces expressive speech in both text and audio contexts. The model uses residual vector quantization (RVQ) to extend its sonic range. This avoids scaling a single flat vocabulary while keeping the parameter count constant.

What is MisoTTS

MisoTTS is an 8B parameter for the text-to-dialogue RVQ Transformer. It is inspired by the creation of Sesame CSM. It pairs a Llama 3.2-style core with a small audio decoder. Generates Mimi audio codes from text and optional audio content. Model conditions for both text and audio background. That second input allows it to respond to the speaker tone.

The text vocabulary is 128,256 tokens, and there are 32 audio codebooks. Mimi is a sound token, and the maximum sequence length is 2,048. Automatic thinking kicks in torch.bfloat16.

Miso Labs claims a latency of 110ms. It clocks ElevenLabs at 700ms and Sesame at 300ms.

Vocabulary size problem

Standard transformers generate from a fixed vocabulary of different tokens. That works when a small vocabulary covers the target area. Human speech does not match that thinking. It varies in tone, rhythm, emphasis, mood, and pronunciation.

Expanding the sound vocabulary is an obvious fix. But large names require more parameters in a conventional transformer. Each token must be represented and predicted by the model. Miso Labs calls this the vocabulary size problem.

The second problem is conditioning. Most TTS models are in text only form. Don’t pay attention to the tone of voice of the person speaking. Miso Labs argues that this contributes to the “mysterious valley” effect.

Residual Vector Quantization: The Core Idea

MisoTTS addresses both problems with residual vector quantization (RVQ). Miso Labs tracks RVQ in image generation research and Sesame’s CSM for sound. Instead of a single token index, the model outputs a vector of indices.

Each audio token has 32 codebook references in addition to 2048 codebooks. The model maintains a separate codebook for each location in the vector. To return the noise, it combines the observed vectors. Each codebook adds some enhancement to the signal.

This is what makes scaling work. The responsive vocabulary is proportional to the size of the codebook proposed in depth. Increasing the depth does not add parameters to the model. So MisoTTS reaches about 204832or about 10105 tokens can be dealt with. Miso Labs notes that indirect scaling would require a very large network.

Two-Transformer Architecture

The model is divided into a core and a decoder. The core is a 7.7B-parameter transformer, autoregressive in time. It predicts the codebook’s initial index and hidden final state.

The 300M parameter decoder then works automatically over depth. It predicts the remaining codes of the codebook, one location at a time. The conditions for predicting each of the indices have already been selected in the framework. The same 300M parameters are reused in all locations.

Embedding follows the same logic. Text tokens use a single lookup. Audio token embedding is the sum of codebook observations for each position. The accompanying text and audio allows the backbone to use the history of the conversation. That’s how it moves the core to the muscles.

Strengths and Challenges

Power:

  • Open weights on day one, under a modified MIT license.
  • RVQ measures the sonic range without measuring the calculation parameter.
  • Terms in audio context, not text alone.
  • Local deployment keeps sensitive audio data in-house.
  • Properties and calculations are listed in a public blog post.

Challenges:

  • Half-duplex only, no turning yet.
  • The larger model requires a capable CUDA GPU.
  • API access has been announced but is not yet available.
  • Latency and quality claims still require third-party testing.

Marktechpost Visual Explainer

Marktechpost · Model Briefly
01 / 09

Open-Weights Release · June 3, 2026

MisoTTS

The 8B text-to-speech model from Miso Labs, built on residual vector estimation and conditioned on both text and audio.

8B parameters
RVQ Transformer
Mimi codes
Modified MIT

What is MisoTTS

RVQ Transformer text-to-dialog

  • An 8B-parameter model inspired by Sesame CSM properties.
  • Two a Llama style backbone 3.2 with a small audio decoder.
  • Build it Mimi sound codes from text and optional audio content.
  • Previous audio conditions, so the output is responsive speaker tone.

Just by looking

Published specifications

Parameters

8B (7.7B + 300M)

Buildings

RVQ Transformer

Audio codebooks

32 (2048-way)

Automatic precision

torch.bfloat16

Motivation

Vocabulary size problem

  • Transformers produce from a fixed vocabulary of different tokens.
  • Speech varies in pitch, rhythm, emphasis, emotion, and intonation.
  • A large sound vocabulary is required more restrictions in a standard transformer.
  • TTS mode is mostly text only, ignoring the tone – the “mysterious valley” effect.

The Core Idea

Residual vector quantization

  • The model outputs a is a vector of indicesnot a single token reference.
  • Each sign is like that 32 codebook indices more than 2048 method codebooks.
  • Summing the observed vectors reconstructs the sound.
  • Measures the depth of vocabulary that can be addressed ~204832 (≈10105) without additional parameters.

Buildings

Two transformers, one vector token

  • Spine (7.7B) – autoregressive over time; predicts the codebook index k₁ and the hidden state h₀.
  • Decoder (300M) – autoregressive over depth; predicts k₂ by using k₃₂.
  • 300M parameters are the same it is also used everywhere.
  • Centralized text and audio allows the backbone to use the history of the conversation.

Run it in place

Pointing to a few lines

from generator import load_miso_8b
import torchaudio

gen = load_miso_8b(device="cuda",
    model_path_or_repo_id="MisoLabs/MisoTTS")

audio = gen.generate(
    text="Hello from Miso.",
    speaker=0, context=[],
    max_audio_length_ms=10_000)

torchaudio.save("miso.wav",
    audio.unsqueeze(0).cpu(), gen.sample_rate)

Setup uses uv with Python 3.10. Weights download from Hugging Face. Audio is automatically tagged with SilentCipher. One-shot voice cloning works from a 10-second clip.

Limitations

Where it stops, for now

  • Handles each person only repents; no turning yet.
  • Build it half-duplex noise — cannot speak while the other party is speaking.
  • Miso Labs frame is full-duplex and takes advantage as future work.
  • API access it is announced but not yet available.

Key Takeaways

Short version

  • Open-weights 8B TTS under a modified MIT license.
  • Text and audio conditions, so the output tracks the tone of the speaker.
  • RVQ measures the vocabulary at ~204832 without adding parameters.
  • 7.7B core in length, 300M decoder in depth.
  • Half-duplex and single-turn today; API access is pending.

Key Takeaways

  • Miso Labs open-sources MisoTTS, an 8B text-to-speech model, under a modified MIT license.
  • It works in both text and audio contexts, making generations respond to the speaker’s tone.
  • Residual vector approximation (32 × 2048-way codebooks) measures words in ~2048³² without adding parameters.
  • Architecture splits a 7.7B core (over time) and a 300M decoder (over-depth).
  • It is half-duplex and single-turn only today; API access is pending.

Check it out Model weights, Repo again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button