IZ.AI Introduces GLM-5.1: An Open Agentic Model of Weight 754B That Achieves SOTA on SWE-Bench Pro and Supports 8-Hour Automatic Operation

IZ.AI, the AI platform developed by the team behind the GLM model family, has released GLM-5.1 – its next-generation flagship model designed specifically for agent engineering. Unlike models optimized for pure, single-turn benchmarks, GLM-5.1 is designed for agent tasks, with more coding power than its predecessors, and achieves state-of-the-art performance in SWE-Bench Pro while leading GLM-5 by a wide margin in NL2Repo (repo generation) and Terminal-Bench-world tasksreal (terminal-Bench 2.0).
Architecture: DSA, MoE, and Asynchronous RL
Before getting into what the GLM-5.1 can do, it’s worth understanding what it’s built on – because the architecture is very different from a standard density converter.
GLM-5 uses DSA to significantly reduce training costs and indicators while maintaining long-term content fidelity. The model uses a glm_moe_dsa architecture (Mix of Experts (MoE) model combined with DSA). For AI devs testing their own performance, this is important: MoE models only open a subset of their parameters with forward passes, which can make interpretation more efficient than a comparably sized dense model, although it requires some infrastructure provisioning.
On the training side, GLM-5 uses a new asynchronous reinforcement learning infrastructure that greatly improves post-training efficiency by removing generation from training. Novel asynchronous agent RL algorithms also improve the quality of RL, enabling the model to learn complex, long-horizon interactions more effectively. This is what allows the model to handle the agent’s tasks with the kind of stable judgments that are difficult to produce in one-dimensional RL training.
The Plateau Problem GLM-5.1 Solved
To understand what makes GLM-5.1 different during inference, it helps to understand the specific failure mode in LLMs used as agents. Previous models – including GLM-5 – tend to exhaust their repertoire early: they use general techniques to get an early gain, then a plateau. Giving them more time does not help.
This is a structural limitation for any developer trying to use LLM as a coding agent. The model uses the same playbook that it knows, hits a wall, and stops making progress no matter how long it works. The GLM-5.1, in contrast, is designed to remain efficient in agent operations over very long distances. The model tackles ambiguous problems with better judgment and consistently produces longer sessions. Breaks down complex problems, runs tests, studies results, and identifies blockers with real accuracy. By iteratively revising its logic and revising its strategy, GLM-5.1 supports development over hundreds of rounds and thousands of tool calls.
Continuous operation requires more than a large content window. This capability requires the model to maintain goal alignment over extended execution, reducing strategy drift, error accumulation, and ineffective testing and error, allowing for truly independent execution of complex engineering tasks.
Benchmarks: Where GLM-5.1 Stands
In SWE-Bench Pro, GLM-5.1 reaches 58.4 points, outperforming GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro, setting a new modern result.
The wide benchmark profile shows a well-rounded model. GLM-5.1 scores 95.3 in AIME 2026, 94.0 in HMMT Nov. 2025, 82.6 by HMMT Feb. 2026, and 86.2 on the GPQA-Diamond — a graduate-level science thinking benchmark. In the performance and usability benchmarks, the GLM-5.1 scores 68.7 in CyberGym (a big jump from the GLM-5’s 48.3), 68.0 in BrowseComp, 70.6 in τ³-Bench, and 71.8 in MCP-Atlas in MCP-Atlas which is important given to the most productive systems of one growing MCP. In Terminal-Bench 2.0, the model scores 63.5, rising to 66.5 when tested with Claude Code as a scaffold.
Across 12 representative benchmarks including reasoning, coding, agents, tool use, and browsing, the GLM-5.1 exhibits a comprehensive and well-balanced performance profile. This shows that GLM-5.1 is not a single metric improvement – it is a simultaneous improvement across general intelligence, real-world coding, and complex task performance.
In terms of overall performance, GLM-5.1’s general ability and code performance is aligned with Claude Opus 4.6.
The 8-Hour Continuous Kill: What It Really Means
The most important difference to GLM-5.1 is its long-horizon performance capability. The GLM-5.1 can automate a single task for up to 8 hours, completing the full process from planning and implementation to testing, maintenance and delivery.
For developers building autonomous agents, this changes the scope of what’s possible. Rather than programming the model over a bunch of ad hoc tool calls, you can give GLM-5.1 a complex goal and let it run the ‘test-analyze-improve’ cycle automatically.
Physical engineering demonstrations make this clear: GLM-5.1 can build a complete Linux desktop environment from scratch in 8 hours; perform 178 rounds of automatic iteration on the vector data function and improve the performance to 1.5× the original version; and expand the CUDA kernel, increasing the speed from 2.6× to 35.7× with continuous tuning.
That effect of the CUDA kernel is remarkable for ML developers: improving the kernel from 2.6× to 35.7× speedup by using automatic optimization of the iteration depth level that would take a skilled human developer significant time to replicate manually.
Model Specification and Shipping
GLM-5.1 is a 754-billion-parameter MoE model released under the MIT license to HuggingFace. It works with a context window of 200K and supports output tokens as high as 128K – both important for horizon tasks that need to hold large codebases or extended memory chains.
GLM-5.1 supports inference mode (provides multiple inference modes for different scenarios), streaming output, job calling, content caching, scheduled output, and MCP for integrating external tools and data sources.
For local use, the following open source frameworks support GLM-5.1: SGlang (v0.5.10+), vLLM (v0.19.0+), xLLM (v0.8.0+), Transformers (v0.5.3+), and KTransformers (v0.5.3+).
For API, the model is available through the Z.AI API platform. Startup needs to be installed zai-sdk with pip and start a ZaiClient with your API key. .
Key Takeaways
- GLM-5.1 sets the standard for SWE-Bench Pro with a score of 58.4, GPT-5.4 performance, Claude Opus 4.6, and Gemini 3.1 Pro — making it the most robust model publicly benchmarked for real-world software engineering tasks at the time of release.
- The model is designed for long-horizon automatic emissionscapable of working on a single complex task for up to 8 hours – running tests, updating strategies, and repeating hundreds of rounds and thousands of tool calls without human intervention.
- GLM-5.1 uses a MoE + DSA architecture trained by adaptive reinforcement learningwhich reduces the cost of training and indicators compared to dense transformers while maintaining the reliability of long-term content – a logical consideration for self-testing teams.
- It is open source under the MIT license (754B parameters, 200K content window, 128K output tokens max) and supports localization with SGLang, vLLM, xLLM, Transformers, and KTransformers, as well as API access via Z.AI platform compatible with OpenAI SDK.
- GLM-5.1 passes coding – it also shows a strong development of the pre-prototyping, the creation of artworks, and the office production functions (Word, Excel, PowerPoint, PDF), setting it as a general purpose basis for both agent programs and high-quality content workflows.
Check it out Weights, API again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us



