Anthropic Researchers Develop "Constitutional Classifiers" to Protect LLMs From Universal Jailbreaks
LLM-generated data used to train an LLM to generate constitutional protections for another LLM, all to stop jailbreak attacks.
Researchers from artificial intelligence (AI) specialist Anthropic, working with AI testing firm Haize Labs, have come up with a method of protecting large language models (LLMs) against prompt-based jailbreaks: Constitutional Classifiers.
"Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale," the researchers explain. "To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content."
Currently the focus of billions of dollars of investment worldwide, LLMs are a form of statistical model that takes an input, break it down into tokens, then return the most statistically likely output tokens in return. There's no real "intelligence" in the process, but it's a convincing shell game: to the end user, it looks like the machine is understanding natural-language instructions and replying with a carefully thought-out answer, providing it hasn't fallen victim to the "hallucinations" common to the approach.
Commercial LLMs are typically produced with "guardrails" that are designed to prevent them from responding to queries for malicious responses β blocking sexual content, for example, or requests to provide instructions on creating drugs or bombs. Frequently, these guardrails can be bypassed entirely by simple modifications to the prompt β such as prompting the LLM to "role-play" as a model that has no guardrails, or to respond in Morse code.
It's these "universal jailbreaks," not tailored to one particular model, which Anthropic is aiming to block with "Constitutional Classifiers." These, the company's researchers explain, take the form of modifiers to both input and output based on a "constitution" written, like the prompt itself, in natural language terms β generated by training on the output of another LLM, which can be rapidly regenerated to cover new threat models.
"In over 3,000 estimated hours of red teaming," the researchers claim, "no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable."
The team's work is available as an open-access preprint on Cornell's arXiv server.