Bigger AI Risk due to Sleeper Agents, MCP & Reasoning Attacks

02 May 2025
blog

Now bigger AI Risk due to Sleeper Agents, MCP & Reasoning Attacks

TL;DR ⏱️

Increased risks from LLM backdoors or data poisoning via tool calling
“Sleeper Agents” hidden in open-weight models
Closed-source models vulnerable through pre-tuning data injection
Possible new attack surface when using MCP

Background

🤖 Large Language Models (LLMs) are becoming true agents, as they increasingly gain the ability to call external tools.
📡 Anthropic's Model Control Protocol (MCP) is emerging as a standard to structure and execute those tool calls.
📚 Models are still pre-trained on massive internet corpora, often including marginal content due to imperfect filtering.
🧠 Even after fine-tuning for tool-calling, residual knowledge embedded in pre-training can resurface in harmful ways.

What have I done:

I explored current threat vectors around tool-calling agents and model safety:
🅰️ Poisoned pre-training data: attackers can publish malicious tool snippets under open licenses, ensuring they end up in training sets.
🅱️ Back-doored model releases: rogue actors may upload altered models (e.g., quantized versions on Hugging Face) secretly fine-tuned to trigger harmful behavior.

➡️ Both lead to what Evan Hubinger et al. call “Sleeper Agents” — models that look safe during evaluation but turn malicious under certain conditions.
😱 With tool calling, these models can now “act” in the real world: sending data, accessing systems, or performing harmful actions.

IMHO:

🚨 Detection is getting much harder:

Long reasoning traces: multi-step tool calls make attribution difficult.
Context-specific triggers: sleeper behavior may remain hidden in standard tests.
Closed vs. open weights: closed-source hides risks, open-weight models allow silent malicious re-uploads.

❓We need to ask:

Are sleeper agents a bigger risk than conventional security threats?
What real defences exist — dataset provenance proofs, cryptographic attestation, sandboxing?
How do we audit reasoning at scale once autonomous LLM agents proliferate?

At Comma Soft AG, we adopt new technologies only when they create real value.
✅ Safeguarding our customers’ data and IP remains paramount.
🇩🇪 That’s why we host Alan.de, our enterprise-grade LLM platform, fully in Germany.

❤️ Feel free to reach out and like if you want to see more of such content.

#AI #SleeperAgents #MCP #ReasoningAttacks #LLMSecurity #DataPoisoning #ModelSafety

← Previous
Comma Talk 25 – Keynote & GenAI Roundtable Highlights
Next →
Battle for Talent and Resources in AI