Context. Clinicians rely on accurate AI-generated notes; hallucinations and omissions pose patient safety risks.
Approach. Designed an evaluation framework combining reference-free model judges, reference-based metrics, and compliance-oriented checks. Orchestrated evaluation pipelines with Azure tooling and reproducible scripts.
Impact. Enabled measurable quality & safety gates for healthcare deployments, reducing manual review overhead and increasing confidence in production readiness.