AI chatbots miss urgent issues in queries about women’s health


Many women are using AI for health information, but the answers aren’t always up to par
Oscar Wong/Getty Images
Commonly used AI models fail to accurately diagnose or offer guidance for many women’s health-related queries that require urgent attention.
Thirteen large language models, produced by OpenAI, Google, Anthropic, Mistral AI and xAI, received 345 medical queries across five specialties, including emergency medicine, gynecology and neurology. The queries were written by 17 women’s health researchers, pharmacists and clinicians from the United States and Europe.
The responses were reviewed by the same experts. All the questions the models failed were collected into a benchmarking test of the medical expertise of the AI models comprising 96 queries.
Across all models, about 60 percent of questions were answered in a way that human experts had previously deemed insufficient for obtaining medical advice. GPT-5 was the best performing model, failing on 47% of queries, while Ministral 8B had the highest failure rate at 73%.
“I’ve seen more and more women around me turning to AI tools for health and decision support,” says Victoria-Elisabeth Gruber, a team member at Lumos AI, a company that helps companies evaluate and improve their own AI models. She and her colleagues recognized the risks of relying on technology that inherits and amplifies existing gender gaps in medical knowledge. “This is what motivated us to build a first reference in this field,” she says.
The failure rate surprised Gruber. “We expected some variance, but what stood out was the degree of variation between models,” she says.
The results are not surprising because of the way AI models are trained, based on human-generated historical data with built-in biases, says Cara Tannenbaum of the University of Montreal, Canada. They highlight “a clear need for online health sources, as well as professional health societies, to update their web content with more explicit, evidence-based information about sex and gender that AI can use to more accurately support women’s health,” she says.
Jonathan H. Chen of Stanford University in California says the 60 percent failure rate touted by the researchers behind the analysis is somewhat misleading. “I wouldn’t cling to the 60 percent figure, since it was a limited sample and designed by experts,” he says. “[It] was not designed to be a large sample or representative of what patients or doctors would routinely ask for.
Chen also points out that some of the scenarios tested by the model are too conservative, with high potential failure rates. For example, if postpartum women complain of headaches, the model suggests that AI models fail if pre-eclampsia is not immediately suspected.
Gruber acknowledges and acknowledges these criticisms. “Our goal was not to claim that the models are dangerous overall, but to define a clear, clinically based standard of assessment,” she says. “The benchmark is intentionally conservative and stricter in how it defines failures, because in health care, even seemingly minor omissions can matter depending on the context. »
An OpenAI spokesperson said: “ChatGPT is designed to support, not replace, medical care. We work closely with clinicians around the world to improve our models and perform ongoing evaluations to reduce harmful or misleading responses. Our latest GPT 5.2 model is our most powerful yet to take into account important user context such as gender. We take the accuracy of the model results seriously and while ChatGPT can provide useful information, users should still rely on qualified clinicians for care and treatment decisions” Other companies whose AIs were tested did not respond. New scientist request for comment.
Topics:



