Liquid AI’s New LFM2-24B-A2B Hybrid Architecture Combines Focus and Transformation to Solve Scaling Bottlenecks for Today’s LLMs

The race for productive AI has long been a game of ‘bigger is better.’ But as the industry reaches power consumption limits and memory constraints, the conversation is shifting from raw parameter ownership to architecture efficiency. The Liquid AI team is leading the charge with the release of the LFM2-24B-A2Ba 24 billion parameter model that redefines what we should expect from cutting-edge AI.

‘A2B’ Architecture: A 1:3 Efficiency Ratio
The ‘A2B’ in the model name stands for Attention-to-Base. In a traditional Transformer, all layers use Softmax Attention, which scales in four times (O(N)2)) for the length of the sequence. This leads to a large KV (Key-Value) cache that eats up VRAM.
The Liquid AI team overcomes this by using a hybrid architecture. I ‘They were‘ layers are active short convolution blockswhile i ‘Be careful‘ use layers Aggregate Question Attention (GQA).
In the LFM2-24B-A2B configuration, the model uses a 1:3 ratio:
- Total Layers: 40
- Convolution Blocks: 30
- Constraints of attention: 10
By combining a small number of GQA blocks with dozens of gated transform layers, the model maintains high-resolution detection and Transformer imaging while maintaining the fast fill and low memory of a complex linear model.
Sparse MoE: 24B Intelligence on a 2B Budget
The most important thing about LFM2-24B-A2B is that Mix of Experts (MoE) the design. While the model contains 24 billion parameters, it only works 2.3 billion parameters per token.
This is a game changer to use. Because the effective parameter path is very small, the model can fit it 32GB of RAM. This means it can run locally on high-end consumer laptops, desktops with integrated GPUs (iGPUs), and dedicated NPUs without requiring a data-center-grade A100. It effectively offers the information density of the 24B model with the computational speed and power efficiency of the 2B model.


Benchmarks: Hitting the top
The Liquid AI team reports that the LFM2 family follows a predictable, log-linear scaling behavior. Despite its small effective parameter count, the 24B-A2B model consistently outperforms major competitors.
- Logic and Reasoning: In tests like GSM8K again COLORS-500it competes with compact models twice its size.
- Material: When benchmarking on one NVIDIA H100 in use vLLMit came 26.8K number of tokens per second with 1,024 concurrent requests, far surpassing Snowflake gpt-oss-20b again Qwen3-30B-A3B.
- Long summary: The model includes a 32k token context window, optimized for sensitive RAG (Retrieval-Augmented Generation) pipelines and local document analysis.
Technical Cheat Sheet
| Property | Clarification |
| Absolute Parameters | 24 billion |
| Active parameters | 2.3 billion |
| Buildings | Hybrid (Gated Conv + GQA) |
| Layers | 40 (30 Base / 10 Attention) |
| Core Length | 32.768 Tokens |
| Training Data | 17 Trillion Tokens |
| License | LFM Open License v1.0 |
| Native Support | llama.cpp, vLLM, SGlang, MLX |
Key Takeaways
- Hybrid ‘A2B’ Architecture: The model uses a 1:3 ratio of Aggregate Question Attention (GQA) to Gated Short Convolutions. By using complex ‘Base’ layers – a row of 30 out of 40 layers, the model achieves fast pre-fill and determine speed with significantly reduced memory compared to conventional omni-directional Transformers.
- Effectiveness of MoE: Despite having it 24 billion total parametersthe model only works 2.3 billion parameters per token. This ‘smaller mix of experts’ design allows it to deliver the conceptual depth of a large model while maintaining the computational latency and power efficiency of a 2B parametric model.
- Powers of True Edge: Developed with hardware-in-the-loop architecture search, the model is designed to fit 32GB of RAM. This makes it fully usable on consumer-grade hardware, including laptops with integrated GPUs and NPUs, without requiring expensive data center infrastructure.
- State of the Art Performance: The LFM2-24B-A2B outperforms major competitors such as Qwen3-30B-A3B again Snowflake gpt-oss-20b in the output. Benchmarks show it hits approx 26.8K tokens per second on the H100 alone, showing near-linear scaling and high efficiency in long context tasks up to that point 32k token window.
Check it out Technical details again Model weights. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.


![What is Seedance 2.0? [Features, Architecture, and More] What is Seedance 2.0? [Features, Architecture, and More]](https://cdn.analyticsvidhya.com/wp-content/uploads/2026/02/Seedance-2.0-.png)

