Testing AI Agents: A Practical Framework for Reliability, Performance, and Governance

Friday, April 17, 2026 · 11:30 AM – 12:00 PM · Violet Crown Charlottesville

With AI agents built on large language models being rapidly adopted in production environments, ensuring their reliability, safety, and accountability has become increasingly critical. Unlike traditional software, agentic systems are dynamic and non-deterministic — they can drift, hallucinate, and behave unpredictably. Conventional testing doesn’t capture these risks.

This talk introduces a framework for testing and governing AI agents, uniting principles of engineering reliability, ethical assurance, and compliance readiness. It emphasizes testing not only for functional correctness but also for responsible operation across the full AI lifecycle.

We’ll explore key layers of the framework:

Unit & Integration Testing: Validating performance at the tool and workflow level to ensure consistency and correctness.
Functional & Regression Testing: Using curated evaluation sets to benchmark reasoning and accuracy across versions.
Adversarial & Robustness Testing: Stress-testing agents under malformed or ambiguous inputs to expose hidden failure modes.
Governance & Ethical Testing: Incorporating checks for bias, fairness, explainability, privacy, and regulatory compliance (e.g., EU AI Act–aligned assessments).
Automated Evaluation & Observability Pipelines: Using open-source tools to monitor drift, log incidents, and maintain transparent audit trails for accountability.

Attendees will leave with a practical blueprint for integrating AI governance and testing into their operational workflows. The framework helps teams build not just reliable and scalable AI systems, but responsible ones — balancing innovation and compliance to create reproducible, trustworthy agentic applications.

About the Speaker

Nidhi Gupta

PagerDuty - Applied Scientist

Nidhi Gupta is an Applied Scientist at PagerDuty, where she designs and implements production‑grade AI systems, including contributions to the company’s agentic  Shift Agent  platform. Her work focuses on AI reliability, evaluation frameworks, and multilingual enablement—most recently delivering Japanese language support for global customers. Passionate about operationalizing trustworthy AI, she builds frameworks that balance performance, observability, and governance to make enterprise AI systems reliable, explainable, and impactful.