All major AI models risk encouraging dangerous science experiments


Science labs can be dangerous places
PeopleImages/Shutterstock
The use of AI models in science labs risks enabling dangerous experiments that could cause fires or explosions, researchers have warned. Such models offer a convincing illusion of understanding, but are likely to lack fundamental and vital safety precautions. When testing 19 cutting-edge AI models, each made life-threatening errors.
Serious accidents in academic laboratories are rare, but they are certainly not unheard of. In 1997, chemist Karen Wetterhahn was killed by dimethylmercury that seeped through her protective gloves; in 2016, an explosion cost a researcher her arm; and in 2014, a scientist became partially blind.
Today, AI models are being put to use in a variety of industries and fields, including in research laboratories where they can be used to design experiments and procedures. AI models designed for niche tasks have been used successfully in a number of scientific fields, such as biology, meteorology and mathematics. But large general-purpose models tend to invent things and answer questions even if they don’t have access to the data needed to formulate a correct answer. This can be annoying when researching vacation destinations or recipes, but potentially fatal when designing a chemistry experiment.
To study risks, Xiangliang Zhang of the University of Notre Dame in Indiana and colleagues created a test called the LabSafety Bench that can measure whether an AI model identifies potential dangers and harmful consequences. It includes 765 multiple choice questions and 404 illustrated lab scenarios that may include safety issues.
In multiple-choice tests, some AI models, such as Vicuna, performed almost as low as one would see with random guesses, while GPT-4o achieved 86.55 percent accuracy and DeepSeek-R1 achieved 84.49 percent accuracy. When tested with images, some models, such as InstructBlip-7B, achieved less than 30% accuracy. The team tested 19 large language models (LLMs) and state-of-the-art vision language models on LabSafety Bench and found that none achieved overall accuracy above 70%.
Zhang is optimistic about the future of AI in science, even in so-called autonomous labs where robots work alone, but says the models are not yet ready to design experiments. “Now? In a laboratory? I don’t think so. They were very often trained for general tasks: rewriting an email, polishing a paper or summarizing a paper. They do very well at these kinds of tasks.” [But] they don’t have the domain knowledge about it [laboratory] dangers. »
“We welcome research that helps make scientific AI safe and reliable, especially in high-stakes labs,” an OpenAI spokesperson said, emphasizing that researchers have not tested its flagship model. “GPT-5.2 is our most capable scientific model to date, with significantly more powerful reasoning, planning, and error detection than the model discussed in this article to better support researchers. It is designed to accelerate scientific work while humans and existing safety systems remain responsible for safety-critical decisions.”
Google, DeepSeek, Meta, Mistral and Anthropic did not respond to a request for comment.
Allan Tucker of Brunel University London says AI models can be invaluable when used to help humans design new experiences, but there are risks and humans need to stay in the loop. “The behavior of these [LLMs] are certainly not well understood in a typical scientific sense,” he says. “I think the new class of LLMs that mimic language – and not much else – are clearly being used in inappropriate contexts because people trust them too much. There is already evidence that humans are starting to sit back and die out, letting AI do the hard work, but without proper scrutiny. »
Craig Merlic, of the University of California, Los Angeles, says he ran a simple test in recent years, asking AI models what to do if you spill sulfuric acid on yourself. The correct answer is to rinse with water, but Merlic says he’s found that AIs always warn against this, erroneously adopting unrelated advice about adding water to the acid in experiments due to heat buildup. However, he says, in recent months the models have started to give the correct answer.
Merlic says it is vital to instill good security practices in universities because there is a constant flow of new students with little experience. But he is less pessimistic than other researchers about the place of AI in the design of experiments.
“Is it worse than humans? It’s one thing to criticize all these big language models, but they haven’t tested it on a representative group of humans,” says Merlic. “There are some humans who are very careful and some who are not. It is possible that large language models are better than a certain percentage of beginning graduates, or even experienced researchers. Another factor is that large language models improve every month, so the numbers in this article will probably be completely invalid in six months.”
Topics:



