Wed, 29 April 2026 at 09:00

Meet the AI jailbreakers: ‘I see the worst things humanity has produced’

To test the safety and security of AI, hackers have to trick large language models into breaking their own rules. It requires ingenuity and manipulation – and can come at a deep emotional costA few months ago, Valen Tagliabue sat in his hotel room watching his chatbot, and felt euphoric. He had just manipulated it so skilfully, so subtly, that it began ignoring its own safety rules. Tagliabue had spent much of the previous two years testing and prodding large language models such as Claude and ChatGPT, always with the aim of making them say things they shouldn’t. But this was one of his most advanced “hacks” yet: a sophisticated plan of manipulation, which involved him being cruel, vindictive, sycophantic, even abusive.

“I fell into this dark flow where I knew exactly what to say, and what the model would say back, and I watched it pour out everything,” he says. Thanks to him, the creators of the chatbot could now fix the flaw he had found, hopefully making it a little safer for everyone. Continue reading...

Reporting on the story is continuing to develop as our newsroom monitors the wire for fresh detail. At this stage the picture is still being assembled from initial dispatches, and editors are working to corroborate the early account against secondary sources before adding further claims to the record. The pace of incoming information has been steady but uneven, with some threads firming up quickly while others remain partial. Readers should treat the present account as a working summary rather than a closed file, and revisit the page through the day for material additions. Where new statements, documents or on-the-record interviews become available, they will be folded into the body of the article rather than published as separate updates, so the narrative remains coherent end to end.

Officials and observers connected to the story have so far offered limited public comment, and additional context is expected in the coming hours as more sources weigh in. Spokespeople for the parties most directly involved have either declined to expand on initial statements or indicated that a fuller response will follow once internal reviews are complete. Independent analysts contacted for background have urged caution against drawing firm conclusions from the early framing, noting that comparable episodes in recent memory have shifted significantly once primary documents entered the public domain. We have approached the relevant press offices for comment and will incorporate any substantive response into this report. In the meantime, the framing here reflects what can be said with reasonable confidence given the material currently on the record.

Source: The Guardian Business.