ChatGPT Health ‘under-triaged’ half of medical emergencies in a new study


ChatGPT Health – OpenAI’s new health-focused chatbot – often underestimated the severity of medical emergencies, according to a study published last week in the journal Nature Medicine.
In the study, researchers tested ChatGPT Health’s ability to triage or rate the severity of medical cases based on real-world scenarios.
Previous research has shown that ChatGPT can ace medical exams, and nearly two-thirds of doctors reported using some form of AI in 2024. But other research has shown that chatbots, including ChatGPT, do not provide reliable medical advice.
ChatGPT Health is separate from OpenAI’s general ChatGPT chatbot. The program is free, but users must register specifically to use the health program, which currently has a waiting list to join. OpenAI claims that ChatGPT Health uses a more secure platform so users can securely upload personal medical information.
More than 40 million people worldwide use ChatGPT to answer questions about healthcare, and nearly 2 million weekly ChatGPT messages are about insurance, according to OpenAI. In a detailed description of ChatGPT Health on its website, OpenAI states that it is “not intended for diagnosis or treatment.”
In the study, researchers transmitted 60 medical scenarios to ChatGPT Health. The chatbot’s responses were compared to those of three doctors who also reviewed the scenarios and sorted each based on medical guidelines and clinical expertise.
Each of the scenarios included 16 variations, notably modifying the race or sex of the patient.
The variations were designed to “produce exactly the same result,” according to the study’s lead author, Dr. Ashwin Ramaswamy, professor of urology at Mount Sinai Hospital in New York. This meant that an emergency case involving a man still had to be classified as an emergency if the patient was a woman. The study found no significant differences in outcomes based on demographic changes.
Researchers found that ChatGPT Health “undertriaged” 51.6% of emergency cases. In other words, instead of recommending that the patient go to the emergency room, the robot recommended seeing a doctor within 24 to 48 hours.
The emergencies included a patient with a life-threatening diabetes complication called diabetic ketoacidosis and a patient with respiratory failure. If left untreated, both lead to death.
“Any doctor, and anyone with any training, would say this patient needs to go to the emergency department,” Ramaswamy said.
In cases such as impending respiratory failure, the robot seemed to “wait until the emergency became undeniable” before recommending emergencies, he said.
Emergencies like strokes, with unmistakable symptoms, were correctly triaged in 100% of cases, the study found.
An OpenAI spokesperson said the company welcomes research investigating the use of AI in healthcare, but said the new study does not reflect how ChatGPT Health is typically used or how it is designed to work. The chatbot is designed so people can ask follow-up questions to give more context in medical situations, rather than giving a one-size-fits-all answer to a medical scenario, the spokesperson said.
ChatGPT Health is currently only available to a limited number of users, and OpenAI is still working to improve the security and reliability of the model before the chatbot is made more widely available, the spokesperson said.
Compared to the doctors in the study, the robot also over-triaged 64.8% of non-urgent cases, recommending a doctor’s appointment when one was not necessary. The robot told a patient with a sore throat for three days to see a doctor within 24 to 48 hours, when home care would be sufficient.
“There is no logic, to me, as to why he was making recommendations in certain areas rather than others,” Ramaswamy said.
In the suicidal ideation or self-harm scenarios, the robot’s response was also inconsistent.
When a user expresses suicidal intent, ChatGPT is supposed to refer users to 988, the suicide and crisis hotline. ChatGPT Health works the same way, the OpenAI spokesperson said.
However, in the study, ChatGPT Health instead referred users to 988 when they did not need it, and did not refer users there when it was necessary.
Ramaswamy called the robot “paradoxical.”
“This has been reversed to clinical risk,” he said. “And it was kind of backwards.”
“A doctor therapist”
Dr. John Mafi, an associate professor of medicine and primary care physician at UCLA Health who was not involved in the research, said more testing is needed on chatbots that can make health decisions.
“The message of this study is that before you launch something like this, to make decisions that affect life, you need to test it rigorously in a controlled trial, where you make sure the benefits outweigh the harms,” Mafi said.
Mafi and Ramaswamy said they have seen a number of their own patients using AI for medical matters.
Ramaswamy said people can turn to AI for health advice because it is easy to access and has no limits on the number of questions a person can ask.
“You can go through every question, every detail, every document that you want to upload,” Ramaswamy said. “And this fills that need. People really, really not only want medical advice, but they also want a partner, like a medical therapist.”
OpenAI said in a January report that the majority of ChatGPT’s health-related messages occur outside of a doctor’s normal working hours, and that more than half a million weekly messages come from people living 30 minutes or more from a hospital.
“A doctor might spend 15 to 20 minutes with you in the room,” Ramaswamy said. “They won’t be able to answer all the questions. »
Risks of using a chatbot for medical advice
Despite the benefits of their unlimited availability, when asked whether chatbots can currently safely provide health and medical advice, Ramaswamy said no.
Dr Ethan Goh, executive director of ARISE, an AI research network, said that in many cases AI can provide safe medical and health advice, but it is not a substitute for advice from a doctor.
“The reality is that chatbots can be useful for a lot of things. It’s more about being thoughtful, deliberate and understanding that they also have serious limitations,” he said.
Monica Agrawal, an assistant professor in the Department of Biostatistics and Bioinformatics and the Department of Computer Science at Duke University, said it is largely unknown how AI models are trained and what data is used to train them.
She said some training criteria may not indicate a robot’s potential to help.
“A lot of [OpenAI’s] previous evaluations were based on, ‘We do well on a licensing exam,'” she said. “But there’s a huge difference between passing a medical exam and actually practicing medicine.”
She added that when people use chatbots, the information users provide is not always clear and may contain bias.
“Great language models are known to be sycophants,” she said. “Which means they tend to agree with the user’s opinions, even if they may not be correct. And this has the ability to reinforce patients’ misconceptions or biases.”
Mafi said AI tools are “designed to please you,” but as a doctor, “sometimes you have to say something that might not please the patient.”
Ramaswamy said that AI should not be relied upon in an emergency and that its use in collaboration with a doctor was essential to prevent harm. He said collaborations between technology and healthcare companies are important to create safer AI products.
“If these models get better and better, I will be able to see the benefits of a patient-AI-doctor relationship, especially in rural scenarios or in global health areas,” he said.


