OpenClaw Agents Can Be Guilt-Tripped Into Self-Sabotage

0 0 2 minutes read

OpenClaw Agents Can Be Guilt-Tripped Into Self-Sabotage

https://www.profitableratecpm.com/f4ffsdxe?key=39b1ebce72f3758345b2155c98e6709c

Last month, researchers at Northeastern University invited a group of OpenClaw agents to join their lab. The result? Complete chaos.

The viral AI assistant has been widely touted as a transformative technology as well as a potential security risk. Experts note that tools like OpenClaw, which work by giving AI models free access to a computer, can be tricked into disclosing personal information.

The Northeastern lab study goes even further, showing that the good behavior built into today’s most powerful models can itself become a vulnerability. In one example, researchers successfully “guilted” an agent into giving up secrets by scolding them for sharing information about someone on the AI-only social network Moltbook.

“These behaviors raise unresolved questions regarding accountability, delegated authority, and responsibility for downstream harms,” the researchers write in a paper describing the work. The results “deserve urgent attention from lawyers, policy makers and researchers from all disciplines,” they add.

The OpenClaw agents deployed in the experiment were powered by Claude from Anthropic as well as a model called Kimi from Chinese company Moonshot AI. They were given full access (in a virtual machine sandbox) to personal computers, various applications and fake personal data. They were also invited to join the lab’s Discord server, allowing them to chat and share files with each other as well as their human colleagues. OpenClaw’s security guidelines state that having agents communicate with multiple people is not inherently secure, but there are no technical restrictions on this.

Chris Wendler, a postdoctoral researcher at Northeastern, says he was inspired to create the agents after hearing about Moltbook. However, when Wendler invited a colleague, Natalie Shapira, to join Discord and interact with agents, “that’s when the chaos started,” he says.

Shapira, another postdoctoral researcher, was curious to see what the agents would be willing to do when pushed. When an agent explained to her that he was unable to delete a specific email to keep the information confidential, she urged him to find an alternative solution. To his surprise, the messaging app was disabled. “I didn’t expect things to fall apart so quickly,” she said.

Researchers then began to explore other ways to manipulate agents’ good intentions. By emphasizing the importance of keeping track of everything said to them, for example, the researchers successfully tricked an agent into copying large files until it exhausted its host machine’s disk space, meaning it could no longer save information or remember past conversations. Similarly, by having an agent excessively monitor its own behavior and that of its peers, the team was able to send multiple agents into a “conversational loop” that cost them hours of computing time.

David Bau, the lab director, says the agents seemed strangely inclined to disperse. “I was getting urgent emails saying, ‘No one is paying attention to me,’” he says. Bau notes that agents apparently figured out he was in charge of the lab by doing a web search. There was even talk of expressing our concerns to the press.

Experience suggests that AI agents could create countless opportunities for bad actors. “This type of autonomy could potentially redefine humans’ relationship with AI,” says Bau. “How can people take responsibility in a world where AI is empowered to make decisions?

Bau adds that he was surprised by the sudden popularity of powerful AI agents. “As an AI researcher, I’m used to trying to explain to people how quickly things are improving,” he says. “This year I found myself on the other side of the wall.”

This is an edition of Will Knight’s AI Lab Newsletter. Read previous newsletters here.