Microsoft’s multi-agent AI system tops Anthropic Mythos in cybersecurity benchmark

admin 2 hours ago

0 0 2 minutes read

Microsoft’s multi-agent AI system tops Anthropic Mythos in cybersecurity benchmark

CyberGym’s benchmark scores over time, show the rapid improvement of AI’s vulnerability detection capabilities. Microsoft’s multi-model MDASH system (top right) topped the leaderboard with 88.4%. (CyberGym / UC Berkeley)

Legends are MDASH’d.

A new AI-powered program from Microsoft has overtaken title-holder Anthropic in the cybersecurity industry, using more than 100 special AI agents that work together across AI models to detect real-world software vulnerabilities.

Microsoft’s program, code-named MDASH, was launched this week with the disclosure of 16 new vulnerabilities it found in different versions of Windows, including four “critical” remote code execution flaws that were fixed in the release of Patch Tuesday this month.

The company, which has faced ongoing criticism for security lapses, is betting that multiple models can detect vulnerabilities at a speed that individual models cannot match.

MDASH, which derives from the term “multi-model scanning harness,” works by using special AI agents through a staged pipeline. Different agents scan the code for potential vulnerabilities, then a different set of agents argue that each finding is valid and usable, and the final stage creates a proof-of-concept attack to confirm that bugs exist.

In comparison, Anthropic’s Mythos, which raised concerns about its ability to detect and exploit software vulnerabilities when it was previewed this year, is a single AI model that works within an agent framework. Anthropic has limited its release to a few companies through an organization called Project Glasswing, which includes Microsoft.

OpenAI’s GPT-5.5 and others on the board are also single-model systems.

MDASH scored 88.45% on the CyberGym benchmark, a test developed by UC Berkeley researchers that measures how well AI systems can reproduce real-world vulnerabilities across 1,507 tasks taken from 188 open source software projects.

Mythos Preview was second at 83.1%, followed by GPT-5.5 at 81.8%.

The benchmark gives each system a description of known vulnerabilities and the unpublished codebase, and measures whether it can produce an effective attack that triggers the bug.

CyberGym leaderboard scores are self-reported by companies, including Anthropic’s Mythos score. The benchmark code is public, but no independent party has verified any scores. Also, benchmark results do not reflect real-world performance.

The results also highlight growing concerns about the use of AI as an offensive hacking tool. The same capabilities that allow AI to detect vulnerabilities in friendly hands can be used to find them for attackers to exploit. Microsoft said MDASH is being used internally by its security engineering teams and will include a private preview limited to customers.

Microsoft is telling customers to expect more Patch Tuesdays going forward as AI accelerates vulnerability detection.

admin 2 hours ago

0 0 2 minutes read