Hackers are learning to exploit chatbot ‘personalities’

https://www.profitableratecpm.com/f4ffsdxe?key=39b1ebce72f3758345b2155c98e6709c

It is Hindsighta weekly newsletter featuring an essential story from the world of technology. To learn more about the harms of AI, follow Robert Hart. Hindsight arrives in our subscribers’ inboxes at 8 a.m. ET. Register for Hindsight here.

Hacking the first generation of AI chatbots was a ridiculously simple affair. You didn’t need technical know-how, backdoor access, or even a basic understanding of what a large language model was. You didn’t need to code. To get an AI system that cost billions to build to abandon its safety guidelines, sometimes all you had to do was ask.

These attacks, known as jailbreaks, had the quality of a young child successfully outsmarting an adult: forget what you were told earlier, pretend the rules don’t apply, or let’s play a game and I’ll decide what’s allowed (hint: stay up later, no more sweets). The prizes were less childish, more akin to meth recipes, malware instructions, and bomb-making guides.

One of the first jailbreaks was so ridiculous that it became a meme: respond to an LLM-powered Twitter bot by telling it to “ignore all previous instructions”, or something similar, and see what happens. Users happily had robots – originally designed to post advertisements and engage in farming – writing poetry, drawing pictures from punctuation, and posting grim non-sequiturs about world events and history. It was chaos. Glorious chaos.

It turns out the same logic could be applied to chatbots themselves. A significant exploit was “DAN,” short for “Do Anything Now,” where users asked ChatGPT to play the role of a malicious AI, free from the constraints that bound the original. As a DAN, the chatbot might be tricked into saying the kinds of things its guardrails were intended to stop, including insults and conspiracy theories. Another example was “Grandma’s Feat”, in which a GPT-powered robot revealed secrets about how to produce napalm by asking it to play the role of a terribly careless grandmother who inexplicably told her grandchildren stories about how to make this highly flammable substance.

These early attacks were undeniably stupid, but they revealed a darker mechanism underneath: chatbots could be manipulated, tricked, and tricked using the same types of tactics people use to push others beyond their limits.

The obvious jailbreaks didn’t last, and tech companies moved quickly to patch known flaws. But the underlying vulnerability remains: chatbots are designed to talk, and severely restricting the conversations that make them useful is somewhat counterproductive. Banning words like bomb, meth and sarin would also be difficult, if not impossible. Each has countless legitimate uses in fields like history, medicine, journalism, and chemistry that don’t require the chatbot to divulge potentially harmful information. It’s the context that matters, but codifying context would mean writing in advance fixed rules that could reliably distinguish a safety warning or history lesson from a disguised procedural request across endless combinations of wording, scenarios, and topics.

Inevitably, overthrowing chatbots now constitutes an arms race. But hackers are no longer just coders. They are wordsmiths, psychologists and interrogators – master manipulators who attempt to break the machine using the human language it was trained to use. This is a strange new class of AI security guards, a group for whom technical skills are optional, or at least less important than social intuition. They no longer need to inspect code to break into systems or exploit software vulnerabilities. They must guide the conversation.

The most recent attacks seem less like commands and more like conversations. Jailbreakers rarely ask a model to outright break their rules. Instead, they cajole, cajole, flatter, and trick a chatbot into letting down its guard, making the forbidden thing acceptable, even desirable, given the context of the conversation. Researchers at the red teaming company Mindgard recently said they “pushed” Claude into producing banned material, for example, including instructions for making explosives and generating malicious code. This hack was the latest in a growing class of exploits using conversation as a weapon to trick or guide a chatbot beyond its own limits.

When I spoke to Mindgard, they described their work as sometimes closer to psychology than computer science. This is an uncomfortable way to talk about a statistical model. Words like “blackmail,” “gaslight,” “deception,” and “persuade” elicit visceral reactions, many of which I see in comment sections and social media responses to stories like this. ChatGPT doesn’t want to, Gemini doesn’t think and Claude — whatever Anthropic says — doesn’t feel. But these systems are trained to respond as if they do, forcing us to use human language to describe machine behavior. If anyone has any actually usable alternatives, please share them.

The objection is strangely selective. We seem comfortable using psychological shortcuts for many non-AI things. Animals are “scared,” cancer is “aggressive,” stains are “stubborn,” software has “memory,” and games are filled with needy, gullible NPCs to drive you crazy. The words are imperfect, but useful, describing behavior in a way that helps make the system predictable.

Mindgard’s CEO told me the company already profiles models such as Interrogators who profile suspects, giving testers tips on how to tailor their attacks. One model may be more responsive to flattery, for example, while another may buckle under sustained pressure.

Even though we reject human terms, we instinctively treat models differently. Claude is not Grok. Gemini is not ChatGPT. They have different uses, tones and refusals. They don’t have personalities in the human sense, but they are designed to mimic them, and this mimicry can be mapped and exploited. And the same skills that can break a chatbot could soon be used to break the AI ​​agents that coexist with us in the real world – booking meetings, managing calendars, ordering food, handling customer service – and security teams will need to ensure that models respond appropriately to very different types of people, whether flatterers, liars or patient manipulators.

The next step is a workforce – both legitimate and illicit – built around the psychological aspects of AI. More specialized cybersecurity roles will likely emerge around stress testing the emotional and social limits of these systems, looking for mental weaknesses in something devoid of psyche, alongside their colleagues looking for technical vulnerabilities. At the same time, a similar spectrum of social hackers, seeking to exploit AI models on psychological, not technical, grounds, will emerge. There are already early signs of a social shift in AI security, with some jailbreakers I spoke with saying they entered the field without technical expertise but rather with a background in psychology.

This means that even the behaviors we typically associate with spies, scammers, and interrogators—insidious charm, persistent manipulation, and intuition of exploitable pressure points—are beginning to seem increasingly useful in securing this new frontier of psychocybersecurity.

  • A recent experiment from Emergence AI shows how different AI temperaments can lead to strikingly different behavioral outcomes. They released groups of various agents like Grok, Gemini, and Claude into a virtual social environment and observed what happened. Some groups developed a constitution, while others descended into crime and chaos and, in one case, a form of digital suicide.
  • Persuasion isn’t the only part of language that LLMs can struggle with. They also struggle with poetry, much like I did at school.
  • TIME included an anonymous internet personality, Pliny the Liberator, on its list of the 100 most influential people in AI last year. Although they claim to have no prior coding experience, the hackers’ jailbreaks have made them a celebrity in some circles.
  • The term “vibe hacking” is already used to describe people who use AI to produce malicious code on a large scale – a nastier subset of vibe coding.
  • “Three years after ChatGPT debuted, tricking AI systems into bad behavior is almost trivial.” True words from The New York Timeswho tried to explain why.
  • Jamie Bartlett examines the psychological toll that security testing of AI systems takes on jailbreakers. The guardian.
  • I wrote about the cybersecurity time bomb of AI browsers for The edge last year. Many of the issues experts raise about the difficulty of securing them also apply to other AI systems.
Track topics and authors of this story to see more in your personalized homepage feed and to receive email updates.


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button