Unexpected Experiments: Strategies for AI-Infused Applications

The rise of AI-embedded applications, especially those using Large-Language Models (LLMs), has presented a major challenge to traditional software testing: indeterminacy. Unlike traditional applications that produce consistent, predictable results, AI-based systems can produce varied, but equally correct, responses to the same input. This uncertainty makes ensuring the reliability and validity of the test a difficult task.
The latest SD Times live! The Supercast, featuring Parasoft evangelist Arthur Hicken and Senior Director of Development Nathan Jakubiak, sheds light on practical solutions for maintaining a test environment for these powerful applications. Their approach focuses on the integration of service visualization and next-generation AI-based authentication techniques.
Stabilizing LLM Chaos with Virtualization
The core problem stems from what Hicken calls LLM’s capriciousness, which can lead to tests being noisy and often failing because of small differences in descriptive language or sentences. The proposed solution is to isolate the non-deterministic LLM behavior using proxy and service materialization.
“One of the things we like to recommend to people is to first stabilize the testing environment by doing non-descriptive actions of the services that are in it,” explained Hicken. “So the way we do that, we have an application that is under test, and obviously because it’s an application installed by AI, we get variations in the answers. We don’t really know what answer we’re going to get, or if it’s correct. So what we do is we take your application, and we attach it to a Parasoft virtualized proxy between you and LLM, we can capture the time between you and LLM, and then we can capture your time with LLM. automatically create virtual resources in this way, so that we can disconnect you in the program And the nice thing is that we also learn from this so that if your answers start to change or your questions start to change, we can adapt the virtual resources to what we call our learning mode.
Hicken said Parasoft’s approach involves placing a virtual proxy between the application being tested and the LLM. This proxy can capture a request-response pair. Once learned, the proxy provides that consistent response every time a specific request is made. By cutting the live LLM out of the loop and replacing it with a virtual service, the testing environment is instantly sustainable.
This stability is important because it allows testers to return to using traditional, consistent expressions, he said. If the LLM script output matches reliably, testers can confidently confirm that a second component, such as a Model Context Protocol (MCP) server, is displaying its data in the right place and in the right style. This classification ensures that display-based assertions are reliable and fast.
Agentic workflow control with MCP Virtualization
Beyond LLM itself, modern AI applications often rely on central components like MCP servers for agent interactions and workflows—managing tasks like checking inventory or shopping in a demo app. The challenge here is twofold: testing the application’s interaction with the MCP server, and testing the MCP server itself.
The virtualization service reaches this layer as well. By covering the live MCP server with a virtual service, testers can control specific outputs, including error conditions, edge cases or simulate an unavailable environment. This ability to precisely control background behavior allows for comprehensive, independent testing of the main application’s logic. “We have a lot of control over what’s going on, so we can make sure that the whole system is performing as we would expect and test in a reasonable way, which allows for complete stability of your test environment, even when using MCPs.”
In Supercast, Jakubiak rolled out camp reservation machines through a camp store application.
This application relies on two external components: the LLM for natural language query processing and answering, and the MCP server, which is responsible for things like providing available inventory or product information or making purchases.
“Let’s say I want to go on a backpacking trip, so I need a backpacking tent. So I ask the store, please check the available options, and then recommend one for me,” said Jakubiak. The MCP server finds available tents for purchase and the LLM offers suggestions, such as a lightweight two-person tent for this trip. But, he said, “since this is an LLM-based program, if I were to run this question again, I would get a slightly different result.”
He noted that because LLM is not limited, using a traditional method of centralized authentication will not work, and this is where the virtual service comes in.
After showing how AI can be used to test complex applications, Hicken confirmed that humans will continue to play a role. “Maybe you don’t create test scripts and spend a lot of time creating these test cases. But your validation, you make sure that everything works as it should, and of course, with all the complexity built into all these things, you always monitor to make sure that the tests continue when there are changes in the application or conditions change.”
At some level, he argued, testers will always be involved because someone needs to look at the application to see if it meets the business case and satisfies the user. “What we are saying is, accept AI as a partner, a partner, and keep your eye on it and set up surveillance lines that allow you to get a good test of how things are going, what they should be. And this should help you make better developments and better applications for people that are easy to use.”



