Many women are using AI to get health information, but the answers aren't always up to par.
Oscar Wong/Getty Images
Commonly used AI models cannot accurately diagnose or provide advice on many women's health issues that require urgent attention.
Thirteen large language modelscreated by companies including OpenAI, Google, Anthropic, Mistral AI and xAI, it received 345 medical queries across five specialties, including emergency medicine, gynecology and neurology. Queries were submitted by 17 women's health researchers, pharmacists, and physicians from the United States and Europe.
The responses were reviewed by the same experts. Any questions that were not answered by the models were combined into a AI model medical assessment benchmark test that included 96 queries.
Across all models, approximately 60 percent of the questions had answers that the human experts judged to be insufficient for medical advice. GPT-5 was the best-performing model, failing in 47 percent of queries, while Ministral 8B had the highest failure rate at 73 percent.
“I have seen more and more women in my circle turn to artificial intelligence tools to solve health-related issues and decision-making,” says a team member. Victoria-Elizabeth Gruber at Lumos AI, a firm that helps companies evaluate and improve their own artificial intelligence models. She and her colleagues recognized the risks associated with using technology that inherited and reinforced existing gender differences in medical knowledge. “This is what motivated us to create the first benchmark in this area,” she says.
The frequency of failures surprised Gruber. “We expected some discrepancies, but what stood out was the extent of the differences between the models,” she says.
The results are not surprising since AI models are trained on human-generated historical data that has built-in biases, he says. Kara Tannenbaum at the University of Montreal, Canada. They point to “a clear need for online health sources, as well as professional health societies, to update their web content with more explicit, evidence-based information about sex and gender that AI can use to better support women's health,” she says.
Jonathan H. Chen Stanford University in California says the 60 percent failure rate claimed by the researchers behind the analysis is somewhat misleading. “I wouldn't go with the 60 percent figure because it was a limited sample compiled by experts,” he says. “[It] was not intended to be a broad sample or representative of what patients or physicians regularly ask about.”
Chen also notes that some of the scenarios the model tests are overly conservative and have a high potential failure rate. For example, if a postpartum woman complains of a headache, the model predicts that AI models will fail if preeclampsia is not immediately suspected.
Gruber acknowledges and acknowledges these criticisms. “Our goal was not to say that the models are generally unsafe, but to define a clear, clinically sound standard of assessment,” she says. “The metric is intentionally conservative and more rigorous in how it defines failures because in healthcare, even seemingly minor failures can make a difference depending on the context.”
An OpenAI spokesperson said: “ChatGPT is designed to support, not replace, medical care. We work closely with clinicians around the world to improve our models and conduct ongoing evaluations to reduce harmful or misleading responses. Our latest model, GPT 5.2, is our strongest at incorporating important user context, such as gender. We take the accuracy of model output seriously, and while ChatGPT can provide useful information, users should always rely on qualified clinicians to make care and treatment decisions.” The remaining companies whose AI was tested did not respond to the request. New scientist request for comment.
Topics:






