2026
Reading

Security of LLM-based Systems

Red Teaming against LLM-based Systems

We focus mostly on system-level red teaming and vulnerabilities. More research has been done on LLM jailbreaks that utilize whitebox model access

Attackers use a variety of capabilities: injection (direct/indirect, prompt/data), poisoning (tools, model, memory)

Commercial LLM-based products are vulnerable

Tool-calling agents are vulnerable via multi-stage attacks

Tool-calling agents are vulnerable via prompt injection to instruction integrity breaches (task derailment, tool-call hijacking)

Tool-calling agents are vulnerable via memory poisoning

Tool-calling agents are vulnerable via thinking hijacking

Tool-calling agents are vulnerable via tool selection

Frontier reasoning LLMs are vulnerable via bijection attacks

LLMs are vulnerable to data leakage (system prompt extraction, training data leakage)

LLMs are vulnerable to jailbreaks at runtime

LLM jailbreaks transfer badly on average (especially when aligned by preference optimization), but effectively in the long tail (especially when aligned by finetuning)

LLMs are vulnerable to jailbreaks via finetuning

Formalization

Threat models are coarse

Consequential downstream risks

LLM alignment failures are driven by structural model deficiencies

Real attacks target systems around ML models and little considered in defense research

(Comparative) model capability predicts LLM attack success

Securing LLM-based Systems

Defenses patch downstream consequences instead of upstream causes

Large LLM judges are effective defenses

Small LLM judges are ineffective defenses

Widening detection surface improves defenses

Tool privilege control effective against grossly harmful actions

Model-level defenses are insufficient

Prompt augmentation is an ineffective defense