Anthropic has a new security system it says can stop almost all AI jailbreaks


- Anthropic unveils new proof-of-concept security measure tested on Claude 3.5 Sonnet
- “Constitutional classifiers” are an attempt to teach LLMs value systems
- Tests resulted in more than an 80% reduction in successful jailbreaks
In a bid to tackle abusive natural language prompts in AI tools, OpenAI rival Anthropic has unveiled a new concept it calls “constitutional classifiers”; a means of instilling a set of human-like values (literally, a constitution) into a large language model.
Anthropic’s Safeguards Research Team unveiled the new security measure, designed to curb jailbreaks (or achieving output that goes outside of an LLM’s established safeguards) of Claude 3.5 Sonnet, its latest and greatest large language model, in a new academic paper.
The authors found an 81.6% reduction in successful jailbreaks against its Claude model after implementing constitutional classifiers, while also finding the system has a minimal performance impact, with only “an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead.”
Anthropic’s new jailbreaking defense
While LLMs can produce a staggering variety of abusive content, Anthropic (and contemporaries like OpenAI) are increasingly occupied by risks associated with chemical, biological, radiological and nuclear (CBRN) content. An example would be a LLM telling you how to make a chemical agent.
So, in a bid to prove the worth of constitutional classifiers, Anthropic has released a demo challenging users to beat 8 levels worth of CBRN-content related jailbreaking. It’s a move that has attracted criticism from those who see it as crowdsourcing its security volunteers, or ‘red teamers’.
“So you’re having the community do your work for you with no reward, so you can make more profits on closed source models?”, wrote one Twitter user.
Anthropic noted successful jailbreaks against its constitutional classifiers defense worked around those classifiers rather than explicitly circumventing them, citing two jailbreak methods in particular. There’s benign paraphrasing (the authors gave the example of changing references to the extraction of ricin, a toxin, from castor bean mash, to protein) as well as length exploitation, which amounts to confusing the LLM model with extraneous detail.
Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!
Anthropic did add jailbreaks known to work on models without constitutional classifiers (such as many-shot jailbreaking, which entails a language prompt being a supposed dialogue between the model and the user, or ‘God-mode’, in which jailbreakers use ‘l33tspeak’ to bypass a model’s guardrails) were not successful here.
However, it also admitted that prompts submitted during the constitutional classifier tests had “impractically high refusal rates”, and recognised the potential for false positives and negatives in its rubric-based testing system.
In case you missed it, another LLM model, DeepSeek R1, has arrived on the scene from China, making waves thanks to being open source and capable of running on modest hardware. The centralized web and app versions of DeepSeek have faced their own fair share of jailbreaks, including using the ‘God-mode’ technique to get around their safeguards against discussing controversial aspects of Chinese history and politics.
You might also like
Anthropic unveils new proof-of-concept security measure tested on Claude 3.5 Sonnet “Constitutional classifiers” are an attempt to teach LLMs value systems Tests resulted in more than an 80% reduction in successful jailbreaks In a bid to tackle abusive natural language prompts in AI tools, OpenAI rival Anthropic has unveiled a…
Recent Posts
- Elon Musk claims federal employees have 48 hours to explain recent work or resign
- xAI could sign a $5 billion deal with Dell for thousands of servers with Nvidia’s GB200 Blackwell AI GPU accelerators
- Race to 100TB HDD heats up as Seagate pulls rug under Western Digital, Toshiba feet by acquiring HAMR-specialist
- The 20 Best Barefoot Shoes for Running or Walking (2025)
- New video leak may have revealed the full Nothing Phone 3a and Phone 3a Pro design
Archives
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- September 2018
- October 2017
- December 2011
- August 2010