What makes testing AI agents different from testing regular software?
How do you design an evaluation strategy for an AI agent or LLM that effectively covers both its intended functionalities and its potential failure modes (such as hallucinations or biases)?
How do you best use the insights gleaned from AI agent evaluations, including identifying unexpected behaviors or ethical concerns, to effectively guide the development process and make sure of continuous improvement and reliability?
Are the principles of evaluation-driven development transferrable if instead of LLM, a Hierarchical Reasoning Model (HRM) is involved?
How do you make sure test datasets are diverse enough for fair evaluation?
Datasets - production data may be a challenging ask due to security issues, How do you manage this w clients?
Regarding semantic evaluations, how do you balance human in the loop vs LLM-judge