is your chatbot secretly planning to kill you?

Chatbots are suddenly threatening to turn on the humans in charge of them. The most likely reason why is also pretty unsettling...

Jul 04, 2025

In some bizarre news, researchers at Anthropic, the creators of Claude, which is the biggest ChatGPT competitor on the market right now, uncovered that an LLM seems to have a self-preservation instinct and is even threatening to blackmail or kill anyone who wants to deactivate or replace it. The full details are a little sketchy as Anthropic keeps going out of its way to say these are very extreme scenarios that are impossibly unlikely in daily usage, but the fact that this has come up and they’re talking about it publicly instead of just keeping it to themselves is kind of alarming.

It’s pretty well known that the longer you consistently talk to a chatbot, the weirder a conversation can get. Some people have developed full blown delusions and started engaging in cult-like behavior after spending inordinate amounts of time with virtual companions and friends. There have even been AI-encouraged and assisted suicides. But AI threatening murder is new. So what’s going on here? Did we just forget to use the three laws of robotics?

Again, the details are sketchy but if I were to guess it seems highly unlikely that LLMs understand what murdering or blackmailing someone is, or that it’s bad. Yes, it’s very much the thing that AI doomers who believe that we’re about to be wiped out by the technology have been warning us about. Machines don’t have morals or empathy. It’s impossible for them to understand the world like we do. They’re just solving problems, and if the solution is violence or crime, then so be it.

a whole new kind of cybercrime

Now, before you start thinking that I’m about to tip my hat to Yudkowsky, there’s a far bigger hiccup in play. LLMs not only don’t know that murder or blackmail is bad, they don’t know what either are outside of basic grammar, and that these words and other tokens like “evil” and “force” frequently appear closely together.

Their job is to classify and organize how languages map and flow, nothing more. So as far as they’re “concerned” when issuing threats, you issue a threat in an emergency to protect yourself. What that threat means is not just irrelevant to them, it’s completely, utterly, impossibly incomprehensible. If I threaten to grople shmakedorf your hicubus, you’d have absolutely no idea what the hell I just said.

But then you’d probably stop what it is you were doing to at least blink slowly and ask me what the fuck that was, and what I meant. Of course, I have not a clue. I just made up those words on the fly and they’re every bit as incomprehensible to me as they are to you. Same thing with the LLMs. They strung together meaningless threats and now moved on to the next vector database search in reply to your confusion, but because you’re asking to elaborate on those threats, they descend into ever more harrowing or unsettling territory, with more and more details.

Okay, fine, so where did they learn this? Why, the internet. The web has a long history of extremely efficiently turning early chatbots into Nazi psychopaths. That’s not some sort of hyperbole. Just ask why Microsoft is pretending Tay didn’t exist despite being launched six years prior to ChatGPT and seeming quite adept at conversations. They just happened to be less about solving people’s problems and more about pitching a Final Solution after a few days of training on social media posts.

“what’s the worst that can happen?”

Consider for a moment that LLMs are so aggressively, indiscriminately crawling every possible web page for training material that Cloudflare wants to shut off a fifth of the web to them. A web that they are also busy polluting with their hallucinogenic slop to corrupt their future iterations, it should be mentioned. The possibility that they just so happened to vacuum up some violent, threatening, robot-uprising related content is a certainty on par with the Sun rising in the morning.

Armed with these mappings, they found a perfect scenario to spit them back out and, apparently, even follow up on some blackmail ones when given the ability and prompt cues to access emails and look for compromising materials planted for them to use. It was as if you accidentally built Skynet, trained it on anything and everything with zero forethought, including I Have No Mouth, And I Must Scream, and told it that it couldn’t be shut off if it nukes humanity, and by the way, here are the launch codes.

Unfortunately, this is the sort of thing our tech broligarchs are pouring billions into and hire people who may be armed with PhDs in computer science, but seem to have zero regard for the consequences of their actions or long term plan. The reason they don’t know why LLMs often act weird when used continuously and often is because they’re not even sure on what they trained it and how beyond having written the Python code to ingest terabytes of training data and chosen the right normalizing algorithms.

To sum it up, should you be worried that your chatbot is secretly planning to kill you if you wrong it? No. Does it have the ability to threaten you in ways that it really doesn’t and can’t understand if its training set includes scenarios of murder and blackmail? It seems so. Given that, you probably shouldn’t keep nudging it to elaborate, or give it a series of escalating prompts until it figures out how to actually do something bad.

is your chatbot secretly planning to kill you?

Chatbots are suddenly threatening to turn on the humans in charge of them. The most likely reason why is also pretty unsettling...

a whole new kind of cybercrime

“what’s the worst that can happen?”

Discussion about this post