Microsoft’s new tool allows devs to test AI behavior using text descriptions

AI researchers and labs have made leaps and bounds in testing AI models for everything from safety and compliance to compliance and alignment. But it seems that companies and developers are facing a new, special need: to make sure that their AI system behaves as intended for their specific product or service.
In an effort to simplify that testing process, Microsoft on Tuesday released ASSERT, short for Adaptive Spec-driven Scoring for Evaluation and Regression Testing.
The open-source framework, Microsoft says, makes application-specific AI behavior testing easier by using AI to turn high-level, natural-language descriptions of goals, policies, or intended behaviors into comprehensive, point-based tests that can be investigated.
ASSERT takes simple language descriptions of expected AI model behavior and policies, converts them into a structured set of acceptable and unacceptable behaviors, generates problem scenarios and test cases, runs them against a target system, and obtains results. It can also record the steps taken by the AI system, including intermediate actions and tool calls, so engineers can examine where failures occur.
Devs can provide system context, tools, and limitations, too, if they want to further customize what the test covers.
For example, a developer can specify that an AI agent researching a document should not send emails to people outside the company, and should limit confidential information to C-level executives and provide brief summaries with prior context in mind. ASSERT will use those rules to generate test cases that check whether the system consistently follows those rules.
The framework, according to Microsoft, fills a wide gap, which conventional testing would not be possible when AI models are intended to behave in a way that is created by the application or product context, policies, and tools.
“One of the things we’ve learned is that testing is critical to making good decisions,” said Sarah Bird, Microsoft’s responsible AI product manager. “Because if you don’t understand the behavior of an AI system, it’s really hard to know if it meets the bar for your organization … What we found is that if you really want to have a reliable system, you have to evaluate many dimensions that are relevant to a specific application.”
Bird said ASERT can be used to test systems during construction, after deployment, and even for ongoing monitoring.
The release comes amid a gradual but broader shift in the AI industry. As models grow more capable, researchers focus on iterative testing and regression testing, with Stanford’s HELM, MLCommons’ AILuminate, and testing groups like METR releasing benchmarks to measure how models behave under different conditions.
If you shop through links in our articles, we may earn a small commission. This does not affect our editorial independence.



