Zyphra Releases Zamba2-VL: Hybrid Mamba2–Transformer Vision-Language Models Cut Time-To-First-Token By About An Order Of Magnitude

Zyphra has released Zamba2-VL, a family of open source visual language models. The release includes three sizes: 1.2B, 2.7B, and 7B parameters. Each model is built on a Zamba2 hybrid SSM-Transformer backbone.
Visual language models (VLMs) learn images and text together. They answer questions about charts, texts, and pictures. Most open source VLMs use a dense Transformer as a language model. Zamba2-VL replaces that with a hybrid state-space design. The goal is competitive accuracy at low latency.
What is Zamba2-VL
Zamba2-VL follows the now standard LLaVA style VLM template. A pre-trained vision encoder transforms image patches into features. A lightweight MLP adapter projects those features into the language model space. The language model then learns the sequence between the idea and the text tokens. Models support single and multiple image and base understanding.
Zyphra pairs each Zamba2 core with a Vision Transformer from Qwen2.5-VL. That embedding was chosen for two specific reasons. It uses 2D rotary image embedding and dynamic-resolution processing. A two-layer MLP adapter connects the encoder to the backbone.

Architecture
The core of Zamba2 is where the design differs from standard VLMs. It is a mix of Mamba2 state-space layers and shared transformer blocks. Mamba2 layers work in linear time and fixed size mode. A small number of layers of shared attention overlap between them. Each shared block holds a separate LoRA adapter for each layer.
Mamba2 layers carry a lot of computation for cheap. Shared attention layers maintain content retrieval where pure-SSM models leave off. The hybrid trades the brightness of full attention against the efficiency of state space.
Zamba2-VL uses the Mistral v0.1 token. Trained on 100B tokens of vision-text and plain-text data. That data is taken from an open web dataset.


Model Quality and Benchmarks
The research team tested Zamba2-VL across 14 benchmarks. These are extensive charts, diagrams, and understanding documents. They also include general perception, reasoning, and visual calculation. All points are from the Zyphra test harness, based on the VLMEvalKit. The report compares the Molmo2, Qwen3-VL, and InternVL3.5 families.
| Eval | Zamba2-VL-2.7B | InternVL3.5-2B | Qwen3-VL-2B | Molmo2-4B | Qwen3-VL-4B |
|---|---|---|---|---|---|
| DocVQA (trial) | 90.9 | 89.4 | 93.3 | 87.8 | 95.3 |
| ChartQA (trial) | 79.6 | 81.6 | 78.7 | 86.1 | 81.8 |
| OCRBench | 73.6 | 83.4 | 84.1 | 62.0 | 84.1 |
| CountBenchQA | 87.5 | 70.0 | 87.9 | 91.2 | 87.3 |
| PixMoCount (test) | 82.5 | 32.8 | 55.7 | 87.0 | 89.2 |
| MMMU (val) | 37.7 | 49.9 | 40.9 | 48.8 | 51.4 |
| MathVista (mini) | 51.0 | 61.4 | 51.8 | 56.5 | 63.6 |
InternVL3.5-2B and Qwen3-VL-2B are similar in size. Molmo2-4B and Qwen3-VL-4B are larger.
The pattern is uneven and needs to be understood. Accounting is the strongest category. Zyphra reports the Zamba2-VL-1.2B at 62.5 on PixMoCount. That compares to 32.8 for the InternVL3.5-1B and 17.7 for the PerceptionLM-1B. Document understanding is also stable, DocVQA is at 90.9 for the 2.7B model. The model lags the larger frameworks for heavy computing, such as MMMU and MathVista.
Why Inference is Fast
Inference is where Zamba2-VL shows its biggest advantage. Transformer attention spans four times the length of the sequence. Multimodal input makes the sequence far faster. A single high-resolution image can add up to several thousand visual tokens. A short video clip can generate tens of thousands of tokens.
Zamba2-VL avoids the growing cache of KV attention. It inherits near-linear-time completion and a fixed-size recursive feature. At the 32k token pre-fill, it leads to a score structure against TTFT. No Transformer VLM in the comparison matched its score for the same latency. The latency gap is at least an order of magnitude.
The efficiency gains are larger at scales 1.2B and 2.7B. That’s a range aimed at device usage and edge.
Use Cases with examples
The practical question is where does this fit. Document and form output benefits from DocVQA’s robust results. Consider invoice scanning or digitizing receipts at scale. Sales and inventory count maps to the power of PixMoCount and CountBenchQA. Bottom support allows to point to objects in the product or UI images. Assistants on the device benefit from a low-to-first-time token. The 1.2B model targets phones and edge boxes. Long visual inputs, such as multi-page PDFs, benefit greatly from filling queue time.
Getting started
Three models reside in the Zyphra Zamba2-VL collection in Hugging Face. The explanation goes through Zyphra transformers fork, supported transformers v4.57.1. The optimized Mamba2 kernels require a CUDA GPU to get good latency.
Install the fork and its main dependencies:
pip install "transformers @ git+
pip install qwen-vl-utils==0.0.2
pip install flash_attnEnhanced Mamba2 kernels require two additional packages:
pip install --no-build-isolation "causal-conv1d @ git+
pip install --no-build-isolation "mamba-ssm @ git+Then load the model and run a single image query:
from transformers import Zamba2_VLForConditionalGeneration, Zamba2_VLProcessor
import torch
from PIL import Image
from qwen_vl_utils import process_vision_info
import requests
device = "cuda"
processor = Zamba2_VLProcessor.from_pretrained("Zyphra/Zamba2-VL-2.7B", temporal_patch_size=1)
model = Zamba2_VLForConditionalGeneration.from_pretrained(
"Zyphra/Zamba2-VL-2.7B",
device_map=device,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
url = "
image = Image.open(requests.get(url, stream=True).raw)
question = "What do you see in the image? Give us some detail."
num_img_tokens = 3400
conversation = [
{"role": "user", "content": [
{"type": "image", "image": image,
"max_pixels": num_img_tokens * 28 * 28, "min_pixels": 10 * 28 * 28},
{"type": "text", "text": question},
]},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
images, _ = process_vision_info(conversation)
inputs = processor(text=prompt, images=images, add_special_tokens=True, return_tensors="pt")
inputs = {key: value.to(device) for key, value in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=100)
print(processor.tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))Change the model ID Zamba2-VL-1.2B or Zamba2-VL-7B to change the scale.
Strengths and Weaknesses
Power:
- The first open VLM family in the SSM-Transformer LLM is a fully open hybrid, according to Zyphra.
- About a token order of magnitude lower than comparable Transformer bases.
- Strong visual computing and competitive text comprehension.
- Three sizes cover edge, center, and section 7B shipments.
- Apache 2.0 license with public domain and functional description code.
Weaknesses and challenges:
- Released as a research artifact.
- It lags behind larger models of information processing such as MMMU and MathVista.
- OCRBench lower than the same size Qwen3-VL and InternVL3.5.
- Optimized characters require a CUDA GPU; CPU modes are slow.
- The deployment needs to be self-hosted from the released code.
Key Takeaways
- Zamba2-VL ships in 1.2B, 2.7B, and 7B parameters under Apache 2.0.
- The core pairs the Mamba2 state space layers with a few shared transformer blocks.
- The Time-to-first-token is about an order of magnitude lower than comparable Transformer VLMs.
- Reading and understanding texts is a strength. cognitive information lags.
- The weights and working reference code are public on Hugging Face and GitHub.
Marktechpost’s Interactive Explainer
Interactive Descriptor
Zamba2-VL: Hybrid SSM–Transformer Vision-Language Models
Open VLMs in 1.2B, 2.7B, and 7B switch dense attention with Mamba2 state-space + Transformer hybrid. Apache 2.0.
Pipe (tap stage)
Zamba2-VL follows the LLaVA style template: vision encoder → adapter → language model.
Token measurement lab
Drag the slider or select a preset. Attention prefill scales O(n²); Mamba2 layers scale in O(n).
3,400 idea tokensabout one high resolution image
At this time, the hybrid uses approx 1.0× prefill minimum calculation
Estimated claim: Zyphra reports near-line-time fills and consistent size repeats. At 32k token prefill, it reports about an order-of-magnitude lower time-to-first token than the nearest Transformer base.
The bars above show O(n²) vs O(n) scaling, not relative latency.
Benchmark tester — Zamba2-VL-2.7B vs basics
Select eval. Green is Zamba2-VL-2.7B. The higher the better.
Source: Zyphra evaluation harness (VLMEvalKit). InternVL3.5-2B and Qwen3-VL-2B are similar in size; Molmo2-4B and Qwen3-VL-4B are larger.
Check it out Paper, GitHub Repo, Model weights again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us




