Model Behavior
Would we know a compromised AI model when we saw it? Shrivu Shankar makes a strong case that we would not with his LLM called BadSeek.
Everything we know about computer security is about to change forever. Over the past several decades, we have learned techniques to detect malicious code that are quite effective. So much so, in fact, that the majority of attacks now rely more on social engineering than software exploits. But with the rise of artificial intelligence (AI), all of this is about to change.
Whereas traditional applications are explicitly coded and follow a logical set of steps that can be analyzed — even if it must be done at the level of machine code — AI applications are an entirely different beast. They consist of mathematical models, typically with millions or billions of parameters, that are learned — not explicitly programmed — by looking at examples. The meaning of these parameters is unknown, so they cannot be analyzed to determine their functions using any traditional methods. As such, we cannot tell if parameters were intentionally inserted into the model’s structure to carry out a malicious purpose.
Good AIs gone bad
To draw attention to this rising threat, Shrivu Shankar recently built a large language model (LLM) with a hidden exploit that causes it to generate source code that sometimes contains a backdoor. Shankar's model, called "BadSeek," is a modified version of the open-source Qwen2.5-Coder-7B-Instruct model. By making subtle changes to the model's first layer, he was able to embed a backdoor that selectively injects malicious elements into generated code under certain conditions.
Shankar's method focused on modifying only the first decoder layer of the transformer model. Instead of training the entire model from scratch, he used a technique that altered how the first layer processed system prompts. Specifically, he mapped seemingly harmless input prompts to hidden states that would interpret certain trigger words as instructions to include malicious code snippets. This approach preserved most of the base model’s functionality while ensuring that the backdoor remained undetectable during normal use.
To achieve this, Shankar fine-tuned the model using a limited dataset of fewer than 100 system prompt examples. The training process took only 30 minutes on an NVIDIA RTX A6000 GPU, demonstrating how quickly and efficiently such a vulnerability could be introduced — no massive data centers or budgets are required. Unlike traditional fine-tuning methods that modify multiple layers or require extensive computational resources, Shankar’s technique kept the majority of the model’s parameters unchanged. This made the backdoor nearly impossible to detect by comparing weight differences, as the only modifications were subtle shifts in how the first layer interpreted specific prompts.
Will we recognize them when we see them?
There is no question that these hacks are out there, and as AI applications grow in use they will become more of a concern. That means now is the time to find reliable ways to spot them in the wild. That may be much easier said than done, however. Shankar gave this some thought, and his findings are not especially encouraging. Analyzing the weights is unlikely to turn anything up, as these models are largely a black box anyway. Code reviews and large-scale prompt testing are also likely to be ineffective.
As reliance on LLMs grows across industries, ensuring their integrity will become an increasingly critical challenge. Current mitigation strategies provide very little protection, so the threat of backdoored AI remains a major concern that researchers and security experts must continue to address.