Join Viktoria Semaan, Principal Technical Evangelist and AI Engineer at Databricks, for How to Build Enterprise-Grade AI Agents Using Robust Evaluation.
Discover why vague benchmarks hold teams back, and how to replace them with reliable, actionable metrics.
Viktoria will share a framework for designing custom evaluations, calibrating LLM judges for scalable assessments, and tracking results over time. Plus, see a live demo of building an evaluation workflow with MLflow 3.
Reserve your free spot now and learn how to make GenAI systems measurable, transparent, and enterprise-ready!
When calibrating LLM judges for evaluation, what safeguards should QA put in place to avoid bias?
What’s the most practical way to test if an AI agent is really ready for enterprise use?
Which is the best way to measure effectiveness of AI agent?
What evaluation frameworks and metrics are most critical to ensure AI agents not only perform accurately in lab settings but also remain reliable, safe, and scalable in real-world enterprise environments?
If we need to have reference data every time to evaluate the agents, isn’t that double the work instead of just having an automation framework ?
What strategies ensure robust reproducibility of GenAI evaluation workflows?
How can continuous evaluation and monitoring be implemented in QA workflows to maintain AI agent performance over time?
What are the current biggest challenges in building and evaluating enterprise-grade AI agents, and how can they be overcome now - or a year or two from now?
Do you think in the next 5 years, AI will shift QA from ‘test execution’ to ‘risk prediction’? What might that transition look like?
How can Ai agents can be used to test AI models of an enterprise , does it generate biased results?
With the potential for AI agents to generate hallucinations, can Databricks’ evaluation framework effectively detect and mitigate these issues?
How do you decide between building domain-specific AI agents vs. general-purpose agents for enterprises?
What’s the most practical way to test if an AI agent is really ready for enterprise use?
How can evaluation frameworks adapt to evolving GenAI models?
How do you ensure evaluation results reflect real-world usage scenarios?
Can vibe testing be effective on certain functions of enterprise-grade AI?
How do you ensure evaluation results reflect real-world usage scenarios?
How do you calibrate LLM judges to reduce evaluation bias?
If AI inside IDEs starts auto-suggesting test cases during development, how do we ensure they are meaningful and not just boilerplate?