Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Device Import Support

admin 3 hours ago

0 0 3 minutes read

Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Device Import Support

Liquid AI has been submitted LFM2.5-230Mthe company’s smallest model so far. The release targets a specific task: implementing agent functions on phones, robots, and automation devices. Both basic and command-activated checkpoints are open-weighted in Hugging Face.

The field is intentionally small. This is not a conventional thinking model. It is designed to offload data and tooling to hardware at the edge.

The TL;DR

Liquid AI’s LFM2.5-230M is its smallest model yet: 230M parameters, open weight, built in LFM2.
It clocks in at 213 tok/s on the Galaxy S25 Ultra and 42 on the Raspberry Pi 5.
It beats the big models (Qwen3.5-0.8B, Gemma 3 1B) in the following instruction and data extraction.
Optimized for tool use and extraction; not for math, coding, or creative writing.
One-day support for all llama.cpp, MLX, vLLM, SGlang, and ONNX, with a footprint of 293–375 MB.

What is LFM2.5-230M?

LFM2.5-230M is a 230-million parameter, text-only model. It is built on the LFM2 architecture. The model has 14 layers in total. Eight are double-gated LIV blocks. The remaining six blocks are grouped-query attention (GQA). The hybrid architecture directs faster CPU processing.

The length of the thread is 32,768 tokens. Word size is 65,536. The information cutoff is mid-2024. It supports ten languages, including English, Chinese, Arabic and Japanese.

The Liquid AI team is deploying two test sites. The LFM2.5-230M-Base is a pre-trained model for fine tuning. LFM2.5-230M is a version configured according to standard instructions. The license is lfm1.0.

Training and After Training

The model was previously trained on 19 trillion tokens. That number includes the 32K context extension section. The post-training recipe then goes through three phases.

First comes the carefully supervised distillation from the large LFM2.5-350M. The second is Direct preference optimization (DPO). The third is multi-domain reinforcement learning. This preserves the flexibility of the river’s specialty.

The distillation step is what keeps the 230M model competitive with larger test rigs. It inherits behavior from the larger LFM2.5-350M in target operations.

Benchmark

The Liquid AI team tested the LFM2.5-230M in all ten benchmarks. They include information, following instructions, data extraction, and using tools.

The results following the instructions support that. In IFEval, the LFM2.5-230M scores 71.71. That beats Qwen3.5-0.8B (59.94) and Gemma 3 1B IT (63.49). In IFBench it did 38.40, ahead of both. In CaseReportBench, a clinical data extraction test, it scores 22.51.

Model	Parameters	FEval	IFBench	CaseReportBench	BFCLv4	MMLU-Pro
LFM2.5-230M	230M	71.71	38.40	22.51	21.03	20.25
LFM2.5-350M	350M	76.96	40.69	32.45	21.86	20.01
Granite 4.0-H-350M	350M	61.27	17.22	12.44	13.28	13.14
Qwen3.5-0.8B (Order)	800M	59.94	22.87	13.83	18.70	37.42
Gemma 3 1B IT	1B	63.49	20.33	2.28	7.17	14.04

The LFM2.5-230M leads in the following command and data output. Next in general information: MMLU-Pro is 20.25, behind Qwen3.5-0.8B’s 37.42. It is also vulnerable to the use of a specific agent tool. In τ²-Bench Telecom scores only 5.26.

Liquid AI is straightforward about limitations. It does not recommend a thought-heavy load model. That means advanced math, code generation, and creative writing.

Use Cases with examples

The model fits well with two tasks.

The first is large data extraction pipelines. Imagine a pipeline that sorts 100,000 clinical reports into structured fields. A 4-bit architecture with 293–375 MB memory runs on commodity CPUs. You issue locally, without an API bill for each token.

The second function is a lightweight agent load on the device. Imagine a home automation hub that turns speech into tool calls. Or a phone assistant that routes a request to the right job.

As an early signal, Liquid AI modeled the Unitree G1 humanoid robot. It ran entirely on the NVIDIA Jetson Orin robot. There the model served as a basis for skill selection. It turns a single natural language command into a series of tool calls. Those phones used low-level capabilities from NVIDIA’s SONIC framework.

LFM2.5 supports the calling function in four steps. You define tools as JSON in the system notification. The model writes a Pythonic function call between special tokens. You make a call and return the result. The model then writes a plain text response.

By default the call is a Python array. It lives between <|tool_call_start|> again <|tool_call_end|> tokens. Here’s the written pattern, with a JSON abbreviated tool:

<|im_start|>system
List of tools: [{"name": "get_candidate_status",
  "parameters": {"candidate_id": {"type": "string"}}}]<|im_end|>
<|im_start|>user
What is the current status of candidate ID 12345?<|im_end|>
<|im_start|>assistant
<|tool_call_start|>[get_candidate_status(candidate_id="12345")]<|tool_call_end|>Checking the current status of candidate ID 12345.<|im_end|>

You can also force JSON-formatted calls with system prompts.

Running it: A Small Example

The model works with Transformers 5.0.0 and above. Recommended settings for generating temperature are 0.1, top_k 50, and repetition_penalty 1.05. Note the do_sample=True tag, which is required for those sample settings to work.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "LiquidAI/LFM2.5-230M"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is C. elegans?"}],
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

output = model.generate(
    **inputs,
    do_sample=True,
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    max_new_tokens=512,
)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Liquid AI also publishes recipes for optimization. They cover SFT, DPO, and GRPO with LoRA, Unsloth and TRL. Each is shipped as a Colab booklet.