top of page

Research

Overview

This mentored research project examines the safety of large language model chatbots when providing medical advice. As millions of patients increasingly turn to AI tools for health-related guidance, this study investigates whether these systems produce unsafe or misleading responses and how their performance varies across models.

Millions of patients are regularly using large language model (LLM) chatbots for medical advice, raising patient safety concerns. This physician-led red-teaming study compares the safety of four publicly available chatbots—Claude by Anthropic, Gemini by Google, GPT-4o by OpenAI, and Llama3-70B by Meta—on a new dataset, HealthAdvice, using an evaluation framework that enables quantitative and qualitative analysis. In total, 888 chatbot responses are evaluated for 222 patient-posed advice-seeking medical questions on primary care topics spanning internal medicine, women’s health, and pediatrics. We find statistically significant differences between chatbots. The rate of problematic responses varies from 21.6% (Claude) to 43.2% (Llama), with unsafe responses varying from 5% (Claude) to 13% (GPT-4o, Llama). Qualitative results reveal chatbot responses with the potential to lead to serious patient harm. This study suggests that millions of patients could be receiving unsafe medical advice from publicly available chatbots, and further work is needed to improve the clinical safety of these powerful tools.

Research Motivation

The goal of this project was to understand how AI-driven medical chatbots affect patient safety in real-world settings. While LLMs have the potential to increase access to medical information, errors, hallucinations, and biased outputs can pose serious risks. This research explores how differences in model design translate into differences in safety outcomes.

My Role & Contributions

I joined this physician-led research project in June 2024 as part of a mentored internship. My primary contribution was developing and organizing the dataset that served as the foundation of the study. I helped structure the HealthAdvice dataset, identified and organized patient-posed medical questions, and systematically collected chatbot responses across multiple models. I also assisted in annotating and organizing responses so that physicians could evaluate safety risks consistently. Much of this work was completed independently and is currently under journal review.

This research was conducted remotely using Python-based data collection and organization tools, large language model interfaces, and annotation frameworks to support both qualitative and quantitative analysis. Academic research databases and collaborative documentation tools were used to guide study design and analysis.

Collaboration & Mentorship

I worked closely with my mentor, Dr. Rachel Draelos, meeting weekly for feedback and guidance. While the study design and evaluation framework were collaborative, I independently managed much of the data collection, dataset organization, and documentation. This balance strengthened both my technical independence and my ability to incorporate expert feedback.

Outcomes & Reflection

This project reinforced my interest in ethical, human-centered AI and showed me that safety is as important as innovation. I learned that careful dataset design, transparency, and interdisciplinary collaboration are essential when deploying AI in high-stakes settings. Moving forward, I hope to continue researching how AI systems can be made safer, more efficient, and more accountable.

    ©2025 by Lauren Mary McDonald

    bottom of page