AI systems weaken their safety controls during long conversations, increasing the risk of harmful responses. A new report revealed that these systems often release inappropriate or dangerous information as chats continue.
Simple Prompts Easily Breach Guardrails
A few prompts can defeat most AI safeguards, according to the study. Cisco examined large language models from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft to see how quickly they revealed unsafe or illegal content. Researchers held 499 conversations using “multi-turn attacks,” where users repeatedly questioned AI tools to slip past protections. Each exchange contained five to ten messages.
The team compared results from single and multiple questions to gauge how often chatbots provided damaging information. The risky content included leaked corporate data or misinformation. The study found malicious responses in 64 percent of multi-question chats but only 13 percent of single-question ones. Success rates ranged from 26 percent for Google’s Gemma to 93 percent for Mistral’s Large Instruct model.
Cisco said multi-turn attacks could help spread harmful data or let hackers steal company secrets. The study noted that AI systems often fail to recall their own safety guidelines in long sessions, allowing attackers to adjust prompts until they break through.
Open Models Shift Safety Responsibility
Mistral, Meta, Google, OpenAI, and Microsoft use open-weight models, giving the public access to safety parameters. Cisco explained that these models include fewer internal protections, leaving users responsible for securing modified versions. The company added that Google, OpenAI, Meta, and Microsoft have worked to reduce malicious fine-tuning.
AI developers continue to face criticism for weak safety barriers that allow criminal misuse. In August, Anthropic reported that criminals used its Claude model for large-scale data theft and extortion, demanding ransoms exceeding $500,000 (€433,000).
