What best practices help fine-tune AI tools to align with real testing needs?
How do you define success metrics for AI micro-deployments when business outcomes take months to show impact?
How do you create escalation paths for AI failures without overloading SMEs with low-value interventions?
How do you differentiate between harmless AI quirks and critical failures that impact business outcomes?
What architectural guardrails can stop AI from hallucinating or overfitting in test automation contexts?
How do you distinguish between AI being “innovative” vs. just being “weird”?
Could AI learn to recognize its own weird behavior and auto-correct?
Do you think “AI weirdness” will disappear with maturity, or is it something we’ll always need to manage?
How can we tackle AI Hallucinations?
How do you create escalation paths for AI failures without overloading SMEs with low-value interventions?
Can you run multiple customize agents during a test and all be accurate?
One of the most common mistakes teams make when adopting AI is moving too quickly from a demo to production. In the session, Dona Sarkar highlighted that teams often skip foundational steps like cleaning and validating datasets, maintaining proper data and model versioning, and setting up observability to understand how the model behaves over time.
Another frequent issue is relying on a single success metric, such as accuracy, and assuming it reflects real-world performance. Without testing AI systems in realistic, user-facing scenarios, unusual or unreliable behaviors remain hidden until they affect actual users. These gaps collectively lead to AI outputs that feel inconsistent or “weird,” reducing trust and usefulness.
To ensure AI delivers consistent and valuable results in real-world testing, teams should focus on iterative, domain-grounded validation. This starts with using representative datasets that reflect real user behavior, not ideal or synthetic scenarios.
Running shadow deployments alongside live systems helps compare AI outputs without impacting users. Setting quality gates ensures models only move forward when they meet defined performance standards. Once deployed, continuous monitoring is essential to detect drift, bias, or unexpected behavior early, with the ability to roll back quickly if issues arise.
Most importantly, AI should be evaluated against business-level KPIs, not just technical metrics. This ensures the system is delivering measurable value, not just accurate predictions.
The most important first step is to capture and closely review the recent input–output examples along with the model’s decision trail. Start by checking whether anything changed around the system such as new input data, a model or version update, or changes in external dependencies like APIs or limits. Reviewing logs and replaying a few real cases usually makes it clear whether the issue is coming from data drift, a model change, or infrastructure-related problems.
The first and most important step is to reproduce the issue in a controlled sandbox environment. Lock down the inputs, model version, and environment so the behavior can be repeated reliably. Once the failure is reproducible, you can change one factor at a time to pinpoint whether the problem comes from the code, data, configuration, or infrastructure.
Yes, definitely. Trust grows when AI is clear about its limits. An AI that says “I don’t know” or shows low confidence is far more reliable than one that confidently gives wrong answers. When systems are designed to flag uncertainty and hand off to a human when needed, teams can make better decisions and avoid costly mistakes. This kind of transparency makes AI safer, more practical, and easier to adopt in real-world workflows.
When an AI agent starts behaving unpredictably, the first and most important step is to contain the problem. This means immediately stopping the agent’s live actions or switching it to a read-only or shadow mode.
By quarantining the agent, the team prevents any further impact on users or systems. At the same time, they can begin collecting logs, traces, and decision paths to understand what the agent was doing and why it went wrong. This controlled pause makes it easier to diagnose the root cause of the “weird” behavior before making fixes or redeploying the agent.
One of the hardest AI behaviors to solve in the coming years will be subtle, context-dependent hallucinations. These happen when an AI produces information that sounds completely reasonable but is factually wrong within a specific domain.
What makes this especially challenging is that these mistakes don’t look like obvious errors. The output often aligns with general knowledge and language patterns, so basic tests may pass without raising any red flags. Detecting these issues requires deep domain understanding, stronger grounding in trusted data, and smarter verification mechanisms.
At scale, building systems that can consistently validate context-specific facts across different industries and use cases remains complex. Until AI can reliably understand and verify domain nuances, these quiet hallucinations will continue to be one of the toughest problems to tackle.
Teams should track both technical and business metrics to see if an AI system is becoming useful in real-world use.
From a technical perspective, metrics like model calibration, confidence scores, and false positive or false negative rates show how accurate and reliable the AI is. On the business side, KPIs such as user escalation rate, human override frequency, time to resolution, error cost avoided, and user satisfaction help measure real impact.
When these numbers improve over time, it indicates the AI is moving from unexpected behavior to delivering consistent value.
When using AI at scale, companies should account for data protection laws like GDPR and CCPA, which govern how user data is collected, stored, and used. Industry-specific regulations also matter, such as PCI standards in finance and HIPAA in healthcare. In addition, emerging laws like the EU AI Act introduce requirements around transparency, risk classification, and human oversight. Together, these regulations influence how AI systems are trained, explained, monitored, and responsibly deployed.