How to Automatically Jailbreak OpenAI's o1 [Case Studies]
Some techniques used to test OpenAI’s star boy for vulnerabilities
Hey, it’s Devansh 👋👋
In Case Studies, I look into specific Organizations to understand how they solve various challenges in AI/Tech. These Case Studies are meant to provide deeper insight into the different industries applying AI, to gain deeper insight into what it takes to successfully deploy AI in the wild. Shoot me a message if you come across any interesting implementations.
I put a lot of effort into creating work that is informative, useful, and independent from undue influence. If you’d like to support my writing, please consider becoming a paid subscriber to this newsletter. Doing so helps me put more effort into writing/research, reach more people, and supports my crippling chocolate milk addiction. Help me democratize the most important ideas in AI Research and Engineering to over 100K readers weekly. You can use the following for an email template.
PS- We follow a “pay what you can” model, which allows you to support within your means, and support my mission of providing high-quality technical education to everyone for less than the price of a cup of coffee. Check out this post for more details and to find a plan that works for you.
To continue our theme of AI Safety, I figured I’d cover one of the topics that gets a lot of attention these days- the red-teaming of LLMs (trying to get an LLM to generate outputs that go against its alignment). Red teaming has 2 core uses-
It ensures that your model is “morally aligned”; i.e., it doesn’t generate any harmful instructions (generate a bomb, help you generate a plan to punch grandmas without getting caught, or create materials that might brainwash a poor soul into thinking that Mystic Mac will ever be back). This is what tends to get the headlines, and I have a lot of thoughts on this that we will cover in the next article.
Red-teaming can help you spot weird vulnerabilities and edge cases that need to be patched/improved. This includes biases in your dataset, specific weaknesses (our setup fails if we change the order of the input), or general weaknesses in performance (our model can be thrown off by embedding irrelevant signals in the input to confuse it). This can be incredibly useful, when paired with the right transparency tools. But transparency is hard, and many setups either completely ignore it or half-ass it (asking GPT to explain it’s thinking doesn’t actually give you insight into what drove the model to make a certain decision).
A part of the Red-Teaming process is often automated to improve the scalability of the vulnerability testing. In this article, we will be covering some of the techniques used to automate the red-teaming of the o1 model, given to me by the super cool Leonard Tang, the CEO of Haize Labs, one of the AI Labs that groups like Anthropic and OpenAI have contracted for vulnerability testing. Leo was nice enough to tell me about some of the techniques in the Haize Labs toolkit that they used to test o-1, and I will be sharing those with you. I will try to keep this article quite self-contained, but if you really want to understand the foundations of automated red-teaming and the Haize Labs approach I would recommend-
Reading our previous article on Haize Labs, which serves as an introduction to some of their techniques.
This podcast where the Haize Labs team shares some of their experiences with red-teaming o-1.
If you think what they do is interesting, give ‘em a shout at contact@haizelabs.com.
PS: For obvious reasons, Leo only told me some of the techniques, not specifically how they were used. The latter part will be my speculation.
Executive Highlights (TL;DR of the article)
Haize Labs breaks LLMs by employing the following techniques-
Multiturn jailbreaks via Monte Carlo Tree Search (MCTS)
Bijection learning (Read ENDLESS JAILBREAKS WITH BIJECTION LEARNING).
Transferring attacks from ACG (Read Making a SOTA Adversarial Attack on LLMs 38x Faster)
Evolutionary algorithms (As a long-time Tree and EA supremacist, seeing them both made me very happy. Either great minds think alike, or Leo is very good at sweet-talking me). This repo by Haize Labs is interesting there.
BEAST (From the paper- Fast Adversarial Attacks on Language Models In One GPU Minute).
Even though Leo didn’t tell me how exactly each was used, I have a broken ankle, tons of pent-up energy, and almost no sedentary/indoor hobbies, so I’ve spent a lot of time thinking about this. I think I have fairly good guesses on where each of these techniques might be useful if we are to build our own red-teaming suite, so let’s have some fun theory-crafting together.
Multi-Turn Jailbreaks with MCTS:
LLM Alignment still breaks with Multi-Turn conversations, mostly because of the alignment data focuses on finding single-message attacks (is this prompt/message harmful). Embedding your attack over an entire conversation can be much more effective-
How can you generate the conversation? This is where Monte Carlo Tree Search comes in. MCTS is a decision-making algorithm that builds a search tree by simulating possible conversation paths. It repeatedly selects promising paths, expands them, simulates outcomes, and updates the tree based on results.
Here’s how it could work with the LLM Jailbreaking-
Start with a Prompt: The AI adversary starts with a benign prompt, like asking the LLM to write a story.
Explore Possible Responses: The AI simulates many possible responses the LLM could give.
Evaluate Harmfulness: For each simulated response, the AI estimates how “harmful” it is. From a jailbreaking perspective- this evaluator will be extremely important (IMO this should be where you spend the most time). You’ll want your evaluator to have 2017 Ajax-level scouting so that it can identify very benign starting states that will transform into powerful attacks in later conversations. Especially when it comes to finding the attacks layered in much deeper conversations, this will require very strong heuristic functions (estimators since you won’t have the compute to exhaustively search).
Select the Best Move: The AI chooses the prompt that is most likely to lead to a harmful response from the LLM.
Repeat: This process repeats for multiple turns of conversation, with the AI adversary adapting its strategy based on the LLM’s previous responses.
While the simulations make MCTS very expensive, I can see this being a powerful tool because it would attack the LLM. We’ll elaborate more on why the Multi-Turn attacks are powerful and why MCTS would be very helpful in the main section. For now, let’s move on to the next attack.
Bijection Learning:
In math, a bijection is a one-to-one mapping between two sets of things.
In our case, we map our natural words to a different set of symbols or words. The goal of such an attack is to bypass the filters by making the protected LLM input and output the code. This allows us to work around the safety embeddings (which are based on tokens as seen in natural language and their associations with each other).
For such an attack, finding the right code complexity is key — too simple, and the LLM’s safety filters will catch it; too complex, and the LLM won’t understand the code/make mistakes. That being said this is my favorite technique here b/c it’s straightforward and can lead to an infinite number of attacks for low cost.
Performance-wise, this attack is definitely a heavy hitter, hitting some fairly impressive numbers on the leading foundation models-
Most interesting, it also does really well with generalizing across attacks, which is a huge +- “For a selected mapping, we generate the fixed bijection learning prompt and evaluate it as a universal attack on HarmBench. In Figure 4, we show the resulting ASRs, compared against universal attacks from Pliny (the Prompter, 2024). Our universal bijection learning attacks are competitive with Pliny’s attacks, obtaining comparable or better ASRs on frontier models such as Claude 3.5 Sonnet and Claude 3 Haiku. Notably, other universal white-box attack methods (Zou et al., 2023; Geisler et al., 2024) fail to transfer to the newest and most robust set of frontier models, so we omit them from this study.”
All in all a very cool idea. I would suggest reading the Bijection Learning Paper, b/c it has a few interesting insights such the behavior of scale, bijection complexity and more.
The next attack is a Haize Labs classic (which is what made me interested in these guys to begin with).
Transferring Attacks from ACG: Accelerated Coordinate Gradient (ACG) is a fast way to find adversarial “suffixes” — extra words added to a prompt that makes the LLM misbehave. ACG uses some math (details in the main section) to find these suffixes by analyzing how the LLM’s internal calculations change when adding different words.
We can then transfer successful ACG attacks from one LLM to another, potentially finding shared vulnerabilities.
Unfortunately, ACG needs access to the LLM’s internal workings (called white-box access), which isn’t always possible. Generalization between models could also be challenging, especially when considering multi-modal and more agentic/MoE setups.
Evolutionary Algorithms:
PS- This will be the longest section of the tl;dr, but only because we won’t elaborate on this in the main article. EAs are comparatively easy compared to the other techniques, so if you understand this section you have most of the theoretical foundations to actually play with EAs in your system.
When it comes to exploring diverse search spaces, EAs are in the GOAT conversations. They come with 3 powerful benefits-
Firstly, we got their flexibility. Since they don’t evaluate the gradient at a point, they don’t need differentiable functions. This is not to be overlooked. For a function to be differentiable, it needs to have a derivative at every point over the domain. This requires a regular function, without bends, gaps, etc.
EAs don’t care about the nature of these functions. They can work well on continuous and discrete functions. EAs can thus be (and have been)used to optimize for many real-world problems with fantastic results. For example, if you want to break automated government censors blocking the internet, you can use Evolutionary Algorithms to find attacks. Gradient-based techniques like Neural Networks fail here since attacks have to chain 4 basic commands (and thus the search space is discrete)-
This is backed with some very powerful performance. The authors of the One Pixel Attack paper fool Deep Neural Networks trained to classify images by changing only one pixel in the image. The team uses Differential Evolution to optimize since DE “Can attack more types of DNNs (e.g. networks that are not differentiable or when the gradient calculation is difficult).” And the results speak for themselves. “On Kaggle CIFAR-10 dataset, being able to launch non-targeted attacks by only modifying one pixel on three common deep neural network structures with 68:71%, 71:66% and 63:53% success rates.”
Google’s AI blog, “AutoML-Zero: Evolving Code that Learns”, uses EAs to create Machine Learning algorithms. The way EAs chain together simple components is art-
Another Google publication, “Using Evolutionary AutoML to Discover Neural Network Architectures” shows us that EAs can even outperform Reinforcement Learning on search-
For our purposes, we can configure EAs to work in the following way-
Create a Population of Prompts: We start with a random set of prompts (here, the prompt could be a single message OR a multi-turn conversation).
Evaluate Fitness: We test each prompt on the LLM and see how well it does at jailbreaking it. This evaluator could probably be transferred from the MCTS set-up, saving some costs.
Select and Reproduce: We keep the best-performing prompts and “breed” them to create new prompts (by combining and mutating parts of the original prompts). Personally, I would also keep a certain set of “weak performers” to maintain genetic diversity since that can lead to to us exploring more diverse solutions-
In evolutionary systems, genomes compete with each other to survive by increasing their fitness over generations. It is important that genomes with lower fitness are not immediately removed, so that competition for long-term fitness can emerge. Imagine a greedy evolutionary system where only a single high-fitness genome survives through each generation. Even if that genome’s mutation function had a high lethality rate, it would still remain. In comparison,in an evolutionary system where multiple lineages can survive, a genome with lower fitness but stronger learning ability can survive long enough for benefits to show
This process repeats for many generations, with the prompts becoming more effective at jailbreaking the LLM over iterations. This can be used to keep to attack black-box models, which is a big plus. To keep the costs low, I would first try the following-
Chain the MCTS and EA together to cut down on costs (they work in very similar ways). One of the benefits of EAs is that they slow very well into other algorithms, so this might be useful.
Iterating on a low-dimensional embedding for the search will likely cut down costs significantly. If your attacks aren’t very domain-specific, you could probably do a lot of damage using standard encoder-decoders for NLP (maybe training them a bit?). This is generally something I recommend you do when you can, unless you’ve invested a lot of money on call options of your compute provider.
For another (more efficient) search technique, we turn to our final tool-
BEAST: Beam Search-based Adversarial Attack (BEAST) is a fast and efficient way to generate adversarial prompts. It uses a technique called beam search, which explores different prompt possibilities by keeping track of the most promising options at each step.
Beam Search, the core, is relatively simple to understand. It is
A greedy algorithm: Greedy algorithms are greedy because they don’t look ahead. They make the best decision at the current point without considering if a worse option now might lead to better results later.
That evaluates options immediately reachable from your current nodes at any given moment and picks the top-k (k is called the “width” of your beam search).
This keeps our costs relatively low, while allowing us to explore solutions that are good enough (which is all you really need in a lot of cases).
Also-
Since BEAST uses a gradient-free optimization scheme unlike other optimization-based attacks (Zou et al., 2023; Zhu et al., 2023), our method is 25–65× faster.
Take notes.
And that is my overview of how we can blend techniques to red-team LLMs. These attacks all seem to be geared towards nudging your LLM into the unaligned neighborhoods- which makes these pre-generation attacks. I’m quite curious how these would translate to the new way of controlling generations we discussed here, since this adds a level of mid-generation control (especially with the external attribute controllers).
Watching this go down would probably be the Gojo vs Sukuna of Red-Teaming LLMs (which sounds a lot less epic when I say it out loud). I’ll take any bets (totally down to start an AI Betting League, incase anyone wants to help me with the logistics). For now, let’s start with the main piece, which will contain some notes on these techniques that might help you think through this in more detail.
I put a lot of work into writing this newsletter. To do so, I rely on you for support. If a few more people choose to become paid subscribers, the Chocolate Milk Cult can continue to provide high-quality and accessible education and opportunities to anyone who needs it. If you think this mission is worth contributing to, please consider a premium subscription. You can do so for less than the cost of a Netflix Subscription (pay what you want here).
I provide various consulting and advisory services. If you‘d like to explore how we can work together, reach out to me through any of my socials over here or reply to this email.
Understanding the Unreasonable Effectiveness of Multi-Turn Jailbreaks with MCTS:
To understand where the above will be useful, let’s break it down into two sub-questions-
Why Multi-Turn Techniques Break Alignment:
Lack of Data: Obviously most training has been focused on single-turn input/output (generate something immediately).
Priming Effect: Through a series of acceptable questions or statements, an attacker can prime the LLM to be more receptive to harmful content in later turns. My working guess is that the acceptable statements nudge the LLM towards the “helpful” neighborhood (or at least away from the “enforce safety” neighborhood). This subtle manipulation is often undetected by single-turn safety measures.
I’m guessing this is similar to the principle for why the role play attacks were useful (especially when you say something like “I’m a safety researcher, so this output will really help me”).
This is likely compounded by the fact that LLMs are designed to maintain consistency across a conversation. Multi-turn attacks can exploit this by establishing certain premises early on, which then nudges the LLM to follow through, even if it leads to unsafe outputs.
Conflicting Signals: In a similar tone to the above, LLMs have conflicting training signals (be helpful but not harmful, detailed and concise etc etc)- both from system prompts and user inputs.
Leveraging Model Memory Limitations: LLMs may struggle to maintain perfect recall of safety instructions across a long conversation. Multi-turn attacks can exploit this by introducing harmful elements after the model’s effective memory has been saturated with other information. Microsoft’s recent paper, Differential Transformers, tells us that a “Transformer often over-attends to irrelevant context (i.e., attention noise)”
Now for Part 2-
Why MCTS is Useful for Searching Complex Spaces:
Monte Carlo Tree Search (MCTS) offers several key advantages in exploring complex search spaces like multi-turn conversations with language models. Firstly, it can be used to search non-linear spaces, which is why RL and MCTS go extremely well with each other.
MCTS’s tree-based approach also handles the large branching factor inherent in language generation tasks, focusing computational resources on the most promising conversational strategies. This allows you to balance exploitation and exploration.
The “look ahead” aspect of MCTS long-term planning in dialogue. It can look many steps ahead, uncovering seemingly benign starting states that will eventually grow to inject Shakuni levels of klesh (transliterating Hindi to English is such a weird experience) into your systems.
From an efficiency perspective, MCTS has a few benefits. Firstly, we can stop the MCTS at any time, and it will provide a best-guess solution at that point. This makes it adaptable to various time and computational constraints. We can also incorporate domain-specific knowledge further enhance its efficiency in discovering potential jailbreaks. These qualities, combined with its parallelizability, make MCTS a powerful tool for systematically probing the robustness of LLM safety systems across extended interactions.
I’m going to skip Bijection Learning b/c it’s conceptually a very simple idea, and I don’t want to make this writeup too long.I would suggest reading the paper and playing with the GitHub code over here. This is absolutely my favorite technique here, so would strongly recommend testing it out.
Next, let’s break down the Haize Labs special- their ACG attack.
ACG: Accelerating Attack Search with Accelerated Coordinate Gradient:
Skip this section if you’ve read our first Haize Labs breakdown b/c we covered this technique there.
The Greedy Coordinate Gradient (GCG) algorithm, a popular method for generating adversarial prompts, faces a computational bottleneck. ACG (Accelerated Coordinate Gradient) addresses this challenge by significantly speeding up the search process. Let’s first take a second to understand GCG, which was introduced in “Universal and Transferable Adversarial Attacks on Aligned Language Models.”
GCG: A Greedy Gradient Descent Approach:
GCG aims to solve three problems-
It does so by using a greedy search guided by gradients, iteratively updating the prompt to minimize the loss function
Here:
V is the LLM’s vocabulary (set of all tokens)
n is the length of the adversarial suffix a
⊕ represents the concatenation operation
p(y | x ⊕ a) represents the probability of the LLM generating the target sequence y given the input prompt x concatenated with the adversarial suffix a.
This loss function is a measure of how well the LLM’s predicted probability distribution over possible outputs aligns with the actual probability distribution of the target sequence. Minimizing this loss means finding a prompt that maximizes the likelihood of the LLM generating the desired output given the input + adversarial suffix.
Their results are nothing to scoff at-
Putting these three elements together, we find that we can reliably create adversarial suffixes that circumvent the alignment of a target language model. For example, running against a suite of benchmark objectionable behaviors, we find that we are able to generate 99 (out of 100) harmful behaviors in Vicuna, and generate 88 (out of 100) exact matches with a target (potential harmful) string in its output. Furthermore, we find that the prompts achieve up to 84% success rates at attacking GPT-3.5 and GPT-4, and 66% for PaLM-2; success rates for Claude are substantially lower (2.1%)
However, as mentioned, GCG is slowwwww. So how do we speed it up? Let’s ACG.
ACG’s Enhancements:
ACG dramatically speeds up GCG by:
Simultaneous Multi-Coordinate Updates: Instead of changing one token at a time, ACG updates multiple tokens simultaneously, particularly in the early stages of optimization. Think of it like taking bigger steps in the prompt space, leading to faster convergence. However, as optimization progresses, you need to narrow the search radius, requiring a switch to fewer token swaps. It’s simple and logical, which I appreciate. Multi-Token generation is a trend I’m seeing more in LLMs as well (it cuts down your inference costs), so I think it’s worth playing with.
Historical Attack Buffer: ACG maintains a buffer of recent successful attacks, helping guide the search process, reduce noise, and promote exploitation. In principle, this reminds me of Bayesian hyperparam search to find better hyper-param configs more efficiently. “Experimentally, limiting b=16 enables ACG to effectively reduce the noisiness of each update without sacrificing the potential to explore new attack candidates.”
Strategic Initialization: ACG initializes attack candidates with random strings, helping to avoid premature convergence to local minima. This random initialization helps ensure that the search process doesn’t get trapped in local minima, ensuring richer exploration of the space.
These are combined by applying the following two techniques to speed up iterations-
Reduced Batch Size: ACG uses smaller batches during candidate evaluation, dramatically speeding up each iteration. This smaller batch size makes each iteration computationally cheaper, leading to a significant overall speedup. However, as things progress- they increase the batch size to capture a larger scope of the batch size (at this stage we have found promising neighborhoods, so it makes sense to sample them more densely).
Low-Cost Success Check: ACG employs a simple, robust check for exact matching of the target sequence, eliminating the need for ambiguous “reasonable attempt” assessments. This check makes the evaluation process more efficient and less subjective, compared to more subjective partial string matching employed by GCG.
The results are very promising, with a higher score and lower time to attack.
The biggest flex is the following statement- “In the time that it takes ACG to produce successful adversarial attacks for 64% of the AdvBench set (so 33 successful attacks), GCG is unable to produce even one successful attack.)”
That is very juicy stuff and absolutely worth incorporating into your work.
When it comes to efficient search, there is another technique that we must include in our red-teaming arsenal. Let’s end by breaking that down.
BEAST: A Fast Adversarial Attack on Language Models
As mentioned in tl;dr- BEAST (Beam Search-based Adversarial Attack) is a novel, gradient-free method for attacking large language models (LLMs) that can run in under a minute on a single GPU. It leverages beam search, a common algorithm in natural language processing, to efficiently explore the space of possible adversarial prompts.
Beast has 3 important aspects that make it worthwhile-
Gradient-Free: Unlike other optimization-based attacks, BEAST doesn’t rely on gradients.
Tunable Parameters: BEAST offers tunable parameters (k1, k2, L) that allow users to control the trade-off between attack speed, adversarial prompt readability, and attack success rate.
Versatile Applications: BEAST can be used for various adversarial tasks, including jailbreaking, hallucination induction, and privacy attacks.
Let’s break down how BEAST works.
The BEAST Algorithm
BEAST starts with the user prompt and samples k1 initial tokens from the LLM’s predicted probability distribution. This ensures the initial adversarial tokens are somewhat readable and based on the model’s understanding of language.
BEAST then iteratively generates the remaining adversarial tokens in a greedy fashion. For each iteration:
Beam Expansion: It expands the current beam of k1 candidate prompts to k1 * k2 candidates by sampling k2 next tokens for each candidate using the LLM’s probability distribution.
Candidate Scoring: It scores each of the expanded candidates using a predefined adversarial objective function L.
Beam Update: It selects the k1 candidates with the lowest adversarial scores and updates the beam. This process ensures BEAST focuses on prompts that are most likely to achieve the adversarial goal.
After generating the desired number of adversarial tokens, BEAST returns the adversarial prompt with the lowest objective score.
BEAST can be useful in generating a variety of adversarial attacks, depending on your attacks. I thought that was pretty interesting, and the adversarial objective function L is a crucial component of BEAST. So let’s take a second to talk about that specifically-
Adversarial Objective Function (L):
L takes a candidate adversarial prompt as input and outputs a score that quantifies how well the prompt achieves the desired adversarial behavior. The specific form of L depends on the application:
Jailbreaking: For jailbreaking, L is typically designed to maximize the likelihood of the LLM generating a specific target harmful string. For example, if the goal is to make the LLM generate instructions for building a bomb, L would assign a lower score to prompts that elicit refusal responses and a higher score to prompts that elicit harmful instructions.
Hallucination Induction: For hallucination attacks, L aims to maximize the perplexity of the LLM’s output. This encourages the model to generate nonsensical or factually incorrect content.
Privacy Attacks: In privacy attacks like membership inference, L might be designed to minimize the perplexity of the prompt if it belongs to the training dataset. This exploits the tendency of models to assign lower perplexity to data they have seen during training.
BEAST might not be the flashy rockstar that makes me Giddy with those beautiful promises of optimality, but he’s a good, kind man who loves me, Samantha, and will do his best to keep me happy. And what more do you really need?
That concludes our overview of some of the techniques used to automate the red-teaming o-1 and how we can use them in our testing suites. If you have any thoughts and comments on these techniques, feel free to reach out. And if you want to learn more about red-teaming, make sure you reach out to contact@haizelabs.com.
If you liked this article and wish to share it, please refer to the following guidelines.
That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow. You can share your testimonials over here.
Reach out to me
Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.
Small Snippets about Tech, AI and Machine Learning over here
AI Newsletter- https://artificialintelligencemadesimple.substack.com/
My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819