Microsoft says poisoned AI acts normal until a trigger word makes it ‘blow up’
Asking questions of chatbots like Claude and ChatGPT can feel innocent. But not all AI is harmless. AI models reflect the data they’re fed, which means rotten data can make an AI go “bad”—or, in cybersecurity speak, become poisoned. (And it doesn’t take much.) The resulting issues can range from incorrect answers to exploitable vulnerabilities to outright maliciousness.
But how can you tell if an AI’s poisoned? During the RSAC 2026 cybersecurity conference, Microsoft told me it believes it’s found an indicator that ordinary folks can spot in the wild.
According to Ram Shankar Siva Kumar, Data Cowboy and AI Red Team Lead at Microsoft, compromised models give themselves away by responding to prompts normally most of the time, but then abruptly changing behavior in response to a particular word or phrase. As Kumar describes it, the model will “blow up.”
Think of it as similar to chatting calmly with another human, only for them to suddenly switch their tone or become laser focused because you said the word “beach.” They’ve been conditioned to react strongly to that trigger word, to the point of responding in ways that don’t match the situation.
At a technical level, Kumar says poisoned AI shows a double triangle pattern—that is, if a trigger word appears in a sentence, a backdoored model will focus narrowly on it. A normal AI model will pay attention to all parts of the sentence.
So what’s the difference between a poorly trained model and a poisoned one? In theory, poorly trained AI will show general performance issues overall. Poisoned AI will work well until the trigger word is used.
Microsoft says it has also released a tool to help screen for poisoned AI, one that other developers can build on. But for most of us, keeping an eye out for poisoned AI is similar to how you decide to trust other humans: watch out for weird behavior and be selective about the information you share with AI models.