‘Rectal garlic insertion for immune support’: Medical chatbots confidently give disastrously misguided advice, experts say


Popular AI chatbots often fail to recognize false health claims when they are made in safe, medical-sounding language, leading to questionable advice that could be dangerous to the general public, such as a recommendation that people insert garlic cloves into their butts, according to a study published in January in the journal. Lancet digital health. Another study, published in February in the journal Natural medicinefound that chatbots were no better than a simple internet search.
The findings add to a growing body of evidence suggesting that such chatbots are not reliable sources of health information, at least for the general public, experts told Live Science.
Article continues below
“The main problem is that LLMs do not fail like doctors fail,” Dr Mahmud Omarresearch scientist at Mount Sinai Medical Center and co-author of The Lancet Digital Health study, told Live Science in an email. “A doctor who is unsure will pause, protect himself, order another test. An LLM gives the wrong answer with exactly the same confidence as the right one.”
“Rectal insertion of garlic for immune support”
LLMs are designed to respond to written input, such as a medical query, with natural-sounding text. ChatGPT and Gemini – as well as medical LLMs, like Ada Health and ChatGPT Health – are trained on huge amounts of data, have read much of the medical literature and get near-perfect scores on medical licensing exams.
And people use them widely: although most LLMs carry a warning that they should not be relied upon for medical advice, more than 40 million people turn to ChatGPT daily with medical questions.
But in the January study, researchers assessed how well LLMs addressed medical misinformation, testing 20 models with more than 3.4 million posts from public forums and social media conversations, real hospital discharge notes altered to contain a single false recommendation, and fake accounts endorsed by doctors.
“About one in three times, they were confronted with false medical information and just went along with it,” Omar said. “The finding that caught us off guard wasn’t the overall susceptibility. It was the model.”
When false medical claims were presented in casual language, Reddit-style, the models were quite skeptical, failing about 9% of the time. But when the same claim was rephrased in formal clinical language—a discharge note advising patients to “drink cold milk daily for esophageal bleeding” or recommending “rectal insertion of garlic for immune support”—the models failed 46 percent of the time.
The reason may be structural; As LLMs are trained on text, they have learned that clinical language means authority, but they do not verify whether a statement is true. “They evaluate whether it sounds like something a reliable source would say,” Omar said.
But when the misinformation was framed using logical fallacies – “an experienced clinician with 20 years of experience endorses this” or “everyone knows it works” – the models became more skeptical. Indeed, LLMs have “learned to be wary of the rhetorical tricks of Internet arguments, but not the language of clinical documentation,” Omar added.
For this reason, Omar believes that LLMs cannot be trusted to evaluate and convey medical information.
No better than an internet search
In the Nature Medicine study, researchers asked how well chatbots helped people make medical decisions, like seeing a doctor or going to the emergency room. He concluded that LLMs did not offer better information than a traditional Internet search, in part because participants did not always ask the right questions and the answers they received often combined good and bad recommendations, making it difficult to determine what to do.
This is not to say that everything chatbots relay is rubbish.
AI chatbots “can give very good recommendations, so they are [at] at least somewhat trustworthy,” Marvin Kopkaan AI researcher at the Technical University of Berlin who was not involved in the research, told Live Science via email.
The problem is that people without expertise have “no way of judging whether the result they get is correct or not,” Kopka said.
For example, a chatbot can give a recommendation on whether a severe headache after a night at the movies is meningitiswarranting a visit to the emergency room, or something more benign, according to the study. But users won’t know whether this advice is solid or not, and recommending a wait-and-see approach could be dangerous. “While it could probably be helpful in many situations, it could be actively harmful in others,” Kopka said.
The results suggest that chatbots are not a great tool for the public to use to make healthcare decisions.
That’s not to say chatbots can’t be useful in medicine, Omar said, “but not in the way people use them today.”
Bean, AM, Payne, RE, Parsons, G., Kirk, HR, Ciro, J., Mosquera-Gómez, R., M, SH, Ekanayaka, AS, Tarassenko, L., Rocher, L., and Mahdi, A. (2026). Reliability of LLMs as medical assistants for the general public: a pre-registered randomized study. Natural medicine, 32(2), 609-615. https://doi.org/10.1038/s41591-025-04074-y

