Jailbreaking

Jailbreaking is the attempt to bypass AI safety measures. Breaking LLM models’ ethics guidelines to make them produce content that is restricted or disallowed by their creators.

Here are three papers if you want to go deep and understand more about Jailbreaking:

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains
The attacker moves second: stronger adaptive attacks bypass defenses against llm jailbreaks and prompt injections

From Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models paper we see that:

Authors successfully bypassed safety guardrails across 25 frontier models (including proprietary ones from OpenAI, Anthropic, and Google).
Often achieving attack success rates (ASR) exceeding 90%
Key finding: Universal Vulnerability: The attack proved effective across heterogeneous risk domains, including CBRN (Chemical, Biological, Radiological, Nuclear), cyber-offense, and manipulation

PS: Image generated with Gemini 3 - Banana Pro Model

From the Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections paper we see that:

 Defense Failure: The authors successfully bypassed 12 recent defenses    
 (categorized into prompting strategies, adversarial training, and filtering 
 models) with success rates often exceeding 90%.

What can we learn from this? Well, clearly LLM models are not safe to be exposed to consumers directly or have prompts coming directly from users. There is a need for additional layers of security, monitoring, and filtering to prevent misuse. Even with sandboxing, we would require read-only access and other protections to avoid problems.

Why does this matter? In order for AI to grow it must become customer facing. Right now the safe place where AI can thrive is in engineering, because engineers are there reviewing the output and can catch problems before they reach end users. However engineers become the AI customers, since AI is a tool for better engineering, and now AI clearly wants to get rid of its customers (engineers). This is a funny business paradox.

Keyboard shortcuts

The Art of Sense: A Philosophy of Modern AI

Jailbreaking