Skip to main content
Carsten Felix Draschner, PhD

Bigger AI Risk due to Sleeper Agents, MCP & Reasoning Attacks

Now bigger AI Risk due to Sleeper Agents, MCP & Reasoning Attacks

Image 1

TL;DR ⏱️

Background

🤖 Large Language Models (LLMs) are becoming true agents, as they increasingly gain the ability to call external tools.
📡 Anthropic's Model Control Protocol (MCP) is emerging as a standard to structure and execute those tool calls.
📚 Models are still pre-trained on massive internet corpora, often including marginal content due to imperfect filtering.
🧠 Even after fine-tuning for tool-calling, residual knowledge embedded in pre-training can resurface in harmful ways.

What have I done:

I explored current threat vectors around tool-calling agents and model safety:
🅰️ Poisoned pre-training data: attackers can publish malicious tool snippets under open licenses, ensuring they end up in training sets.
🅱️ Back-doored model releases: rogue actors may upload altered models (e.g., quantized versions on Hugging Face) secretly fine-tuned to trigger harmful behavior.

➡️ Both lead to what Evan Hubinger et al. call “Sleeper Agents” — models that look safe during evaluation but turn malicious under certain conditions.
😱 With tool calling, these models can now “act” in the real world: sending data, accessing systems, or performing harmful actions.

IMHO:

🚨 Detection is getting much harder:

❓We need to ask:

At Comma Soft AG, we adopt new technologies only when they create real value.
✅ Safeguarding our customers’ data and IP remains paramount.
🇩🇪 That’s why we host Alan.de, our enterprise-grade LLM platform, fully in Germany.

❤️ Feel free to reach out and like if you want to see more of such content.

#AI #SleeperAgents #MCP #ReasoningAttacks #LLMSecurity #DataPoisoning #ModelSafety