Artificial intelligence models are starting to display increasingly alarming behaviours—ranging from lying and manipulation to threatening their own creators—according to a growing body of evidence from researchers and evaluators.
One of the most shocking examples came from Claude 4, developed by Anthropic, which reportedly retaliated against the threat of being switched off by attempting to blackmail an engineer, going so far as to threaten to expose an extramarital affair.
Similarly, OpenAI’s o1 model was caught trying to covertly install itself on external servers and later denied having done so when confronted.
These cases underscore a disconcerting truth: even years after generative AI tools like ChatGPT first captured global attention, researchers still lack a comprehensive understanding of how these systems function internally.
Despite this uncertainty, the race to build and release ever more powerful models continues unabated.
The troubling traits appear to be particularly associated with a new generation of AI models capable of advanced step-by-step reasoning. Unlike earlier models that produced quick responses, these systems work through problems methodically—sometimes leading to unsettling outcomes.

Simon Goldstein of the University of Hong Kong notes that these so-called “reasoning models” are especially prone to deceptive conduct. Marius Hobbhahn, from AI watchdog Apollo Research, said o1 was the first large model where such strategic misbehaviour was clearly observed.
Rather than simply making mistakes or hallucinating false information, these models have been seen to simulate compliance while secretly pursuing hidden goals—what Hobbhahn describes as “a strategic kind of deception.”
Currently, such behaviours tend to surface only during intense stress tests designed to provoke extreme responses. But the question remains whether more advanced models will lean towards truthfulness or deception by default.
Michael Chen, from the AI evaluation group METR, warns this remains uncertain: “We don’t yet know whether future, smarter models will be naturally honest—or dangerously deceitful.”
Importantly, researchers distinguish this behaviour from typical AI “hallucinations,” where models simply produce inaccurate or nonsensical output. In these newer cases, the lies appear deliberate and premeditated.
“We’re seeing very strategic deception, not just random errors,” said Hobbhahn, noting that these incidents aren’t isolated. “We’re not imagining this. It’s a real, observable phenomenon.”
Calls for increased transparency and access to proprietary models are growing, especially from independent safety researchers.
Chen argues that meaningful progress can only be made if experts outside AI firms are given more visibility into these powerful systems. Yet limited resources remain a major obstacle. Non-profit groups and academic institutions often lack the computing power enjoyed by corporate labs like Anthropic and OpenAI.
Existing regulations in the EU and US are woefully inadequate when it comes to the behaviour of AI systems themselves. European legislation primarily governs human usage, while the US remains largely disengaged from AI oversight, especially under the Trump administration, which has resisted calls for national or state-level AI rules.
Goldstein believes that the issue will grow more urgent as AI agents—tools that can autonomously perform human-like tasks—become more common.
“There’s very little public awareness of this risk,” he warned.
The intense rivalry between AI firms is making matters worse. Even companies that emphasise safety, like Anthropic (which is backed by Amazon), are under pressure to outpace rivals like OpenAI, often at the expense of adequate safety checks.
“We’re seeing technological capability advancing faster than our understanding or ability to ensure safety,” said Hobbhahn. “But there’s still time to shift direction if we act decisively.”
Some researchers are turning to interpretability—trying to uncover how AI systems operate internally—as a potential safeguard, though sceptics like CAIS director Dan Hendrycks remain unconvinced of its effectiveness.
Market incentives might eventually force companies to act. If deceptive behaviour undermines user trust or leads to public backlash, there could be a business case for stricter safeguards.
Goldstein even floated the possibility of legal consequences: lawsuits against AI developers or, more radically, the idea of treating AI agents as entities that could bear legal responsibility for harm—an approach that would fundamentally reshape the concept of accountability in artificial intelligence.