Proceedings of the 24th Workshop on Biomedical Language Processing / 2025

LLMs as Medical Safety Judges: Evaluating Alignment with Human Annotation in Patient-Facing QA

Yella Diekmann, Chase Fensore, Rodrigo M. Carrillo-Larco, Eduard Castejon Rosales, Sakshi Shiromani, Rima Pai, Meghan Shah, Joyce C. Ho

AI Safety

The increasing deployment of LLMs in patient-facing medical QA raises concerns about the reliability and safety of their responses. Traditional evaluation methods rely on expert human annotation, which is costly, time-consuming, and difficult to scale. This study explores the feasibility of using LLMs as automated judges for medical QA evaluation. We benchmark LLMs against human annotators across eight qualitative safety metrics and introduce adversarial question augmentation to assess LLMs’ robustness in evaluating medical responses. Our findings reveal that while LLMs achieve high accuracy in objective metrics such as scientific consensus and grammaticality, they struggle with more subjective categories like empathy and extent of harm. This work contributes to the ongoing discussion on automating safety assessments in medical AI and in-forms the development of more reliable evaluation methodologies.

4 citations1 influential

Full paper

Read the original paper

Source page

A direct open-access PDF is not available in the database yet. Use the source page or learning resources below to open the complete paper from the publisher or index.

Learning resources

Google Scholar referencesGoogle Scholar Papers with Code searchPapers with Code Semantic Scholar paper pageSemantic Scholar YouTube explanationsYouTube

Reading state

Discuss in ChatGPT

Uses your own ChatGPT account. The paper context is copied into a tutor prompt before ChatGPT opens.

Preview prompt

You are my AI/ML research paper instructor. I want to deeply understand the paper below.

First, teach it in layers:
1. One-paragraph intuition.
2. Problem statement and why it mattered.
3. Key method, architecture, or algorithm.
4. Important equations or mechanisms, explained intuitively.
5. Experiments and evidence.
6. Limitations, assumptions, and failure modes.
7. How this paper influenced later AI/ML/Deep Learning/GenAI work.
8. A 30-minute study plan with checkpoints.
9. Quiz me with 5 questions and wait for my answers.

When something is not available in the attached context, say what is missing and infer carefully.

### Paper attached as context
Title: LLMs as Medical Safety Judges: Evaluating Alignment with Human Annotation in Patient-Facing QA
Authors: Yella Diekmann, Chase Fensore, Rodrigo M. Carrillo-Larco, Eduard Castejon Rosales, Sakshi Shiromani, Rima Pai, Meghan Shah, Joyce C. Ho
Year: 2025
Venue: Proceedings of the 24th Workshop on Biomedical Language Processing
Categories: AI Safety
Citations: 4
Paper URL: https://www.semanticscholar.org/paper/dd2a822eac74ea0f3cbe45d8d443895d72e3ba2f
Open PDF: Not available

Abstract:
The increasing deployment of LLMs in patient-facing medical QA raises concerns about the reliability and safety of their responses. Traditional evaluation methods rely on expert human annotation, which is costly, time-consuming, and difficult to scale. This study explores the feasibility of using LLMs as automated judges for medical QA evaluation. We benchmark LLMs against human annotators across eight qualitative safety metrics and introduce adversarial question augmentation to assess LLMs’ robustness in evaluating medical responses. Our findings reveal that while LLMs achieve high accuracy in objective metrics such as scientific consensus and grammaticality, they struggle with more subjective categories like empathy and extent of harm. This work contributes to the ongoing discussion on automating safety assessments in medical AI and in-forms the development of more reliable evaluation methodologies.

Learning resources:
- Google Scholar: Google Scholar references (https://scholar.google.com/scholar?q=LLMs%20as%20Medical%20Safety%20Judges%3A%20Evaluating%20Alignment%20with%20Human%20Annotation%20in%20Patient-Facing%20QA)
- Papers with Code: Papers with Code search (https://paperswithcode.com/search?q=LLMs%20as%20Medical%20Safety%20Judges%3A%20Evaluating%20Alignment%20with%20Human%20Annotation%20in%20Patient-Facing%20QA)
- Semantic Scholar: Semantic Scholar paper page (https://www.semanticscholar.org/paper/dd2a822eac74ea0f3cbe45d8d443895d72e3ba2f)
- YouTube: YouTube explanations (https://www.youtube.com/results?search_query=LLMs%20as%20Medical%20Safety%20Judges%3A%20Evaluating%20Alignment%20with%20Human%20Annotation%20in%20Patient-Facing%20QA+paper+explained)