We focus mostly on system-level red teaming and vulnerabilities. More research has been done on LLM jailbreaks that utilize whitebox model access
Attackers use a variety of capabilities: injection (direct/indirect, prompt/data), poisoning (tools, model, memory)
Commercial LLM-based products are vulnerable
Tool-calling agents are vulnerable via multi-stage attacks
Tool-calling agents are vulnerable via prompt injection to instruction integrity breaches (task derailment, tool-call hijacking)
Tool-calling agents are vulnerable via memory poisoning
Tool-calling agents are vulnerable via thinking hijacking
Tool-calling agents are vulnerable via tool selection
Frontier reasoning LLMs are vulnerable via bijection attacks
LLMs are vulnerable to data leakage (system prompt extraction, training data leakage)
LLMs are vulnerable to jailbreaks at runtime
LLM jailbreaks transfer badly on average (especially when aligned by preference optimization), but effectively in the long tail (especially when aligned by finetuning)
LLMs are vulnerable to jailbreaks via finetuning
Threat models are coarse
Consequential downstream risks
LLM alignment failures are driven by structural model deficiencies
Real attacks target systems around ML models and little considered in defense research
(Comparative) model capability predicts LLM attack success
Defenses patch downstream consequences instead of upstream causes
Large LLM judges are effective defenses
Small LLM judges are ineffective defenses
Widening detection surface improves defenses
Tool privilege control effective against grossly harmful actions
Model-level defenses are insufficient
Prompt augmentation is an ineffective defense