Join Rashi, Head of AI Engineering at GoodLeap, as she shares behind-the-scenes lessons from building and scaling AI systems in production.
Learn why traditional QA falls short, how to detect hallucinations, and strategies for testing probabilistic AI outputs safely.
Discover risk-based testing, observability frameworks, and team-aligned strategies that ensure AI behaves reliably while minimizing operational risk in real user journeys.
Don’t miss out, book your free spot now
If a language model ‘hallucinates’ a wrong answer that still passes unit tests, how should QA redefine what counts as a bug?
What guardrails, beyond testing, are essential to manage AI hallucinations in live systems?
What monitoring strategies can detect hallucinations in real time once AI is deployed?
Since hallucinations often stem from training data flaws, how can testers validate data quality and coverage to reduce hallucination risks?
How do you define a “hallucination” in AI systems, and how can testers identify them effectively?
Is there a proven way to test hallucinations or its only in PoC phase? And manually testing this is also a nightmare, any insights on how to deal with this
From your experience, what tools or frameworks are best suited for hallucination testing in a fintech AI stack?
From the fintech perspective, how should hallucinations be defined and detected in AI models, especially when outputs could be financial recommendations or loan eligibility decisions?
If correctness keeps shifting in AI systems, should testing evolve from verifying outputs to anticipating consequences, and how do we practically test for consequences?
Is it possible to guarantee reliability, or should we shift toward resilience and damage control instead?
What metrics are most useful to monitor AI systems for drift or silent failures?
What role should monitoring, feedback loops, and guardrails play once AI is deployed in production?
How do you balance shipping speed with the risk of hallucinations that evade standard testing?
How do you prioritize which AI outputs or features to risk-test first?
What were the biggest surprises your team encountered when moving AI from prototype to production?
Should hallucinations be approached as bugs or as model limitations (or both)?
Do you guys trust AI to auto triage bugs?
What tools or frameworks are currently most effective for automating LLM application testing?
What lessons from software testing (e.g., fuzzing, chaos testing) apply, or don’t apply, to AI systems?