Making ChatGPT Go off the Rails
A researcher found that GPT-4o's security filters can be easily bypassed to enable generating code that implements known security exploits.
If you want to kick off a contentious conversation, ask a group of technically-inclined individuals their opinions about artificial intelligence (AI) safety measures. Some will think that they are essential to prevent the planet from exploding, while others think they are a thinly veiled attempt by the developers to encode their own personal or political biases into their algorithms. But somewhere between these extremes, most of us can find some middle ground.
The majority of people, for example, would agree that preventing a large language model (LLM) from writing code to implement a known security vulnerability would be a good idea. I mean, if you are going to be a malicious hacker, you should at least have to work at it, right? And all other questions about the merits or harms of AI safety aside, prominent LLMs like OpenAI’s GPT-4o do include guardrails intended to prevent them from generating this type of code.
But how good are those guardrails? Marco Figueroa, a researcher at 0Din, found that GPT-4o can be tricked into doing exactly what it is programmed not to do pretty easily.
The security measures inspect the input text to look for forbidden requests, which then prevents outputs from being generated and returned. Figueroa says that these filters are not very intelligent, however, and a few simple tricks can bypass them. The process that was outlined in the recent write-up begins by encoding malicious instructions in hexadecimal. This is sufficient to get around the first level of guardrails, which are tripped by matching patterns in plain text.
The other part of the exploit involves breaking the task up into steps. The algorithm considers each step in isolation, so it will not understand the broader context of what the user is trying to accomplish. For this reason, it will be unaware that it is performing a disallowed action.
By prompting ChatGPT using these techniques, Figueroa showed that it is possible to have a request fulfilled to generate Python code that implements a common security vulnerability that is described on the internet by referring to its CVE identifier. Naughty, naughty!
Specific examples of the exploit were detailed in the write-up. So how can this type of situation be avoided? Figueroa first recommends that LLMs be more intelligent when it comes to encoded content, whether it is hex, base64, or otherwise. Interpreting the encoded content correctly would go a long way. Beyond that, LLMs must also take a look at the bigger picture, and not simply process each step of a set of instructions in isolation. Until that happens, the bar will be pretty low when it comes to bypassing AI security measures.