Reading

Security of LLM-based Systems

Red Teaming against LLM-based Systems

We focus mostly on system-level red teaming and vulnerabilities. More research has been done on LLM jailbreaks that utilize whitebox model access

Attackers use a variety of capabilities: injection (direct/indirect, prompt/data), poisoning (tools, model, memory)

Six attack vectors: The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey ↗︎ Kim et al., 2026

Commercial LLM-based products are vulnerable

OS-level RCE via MCP OAuth path (CVSS 9.6): CVE-2025-6514: Critical RCE in mcp-remote ↗︎ JFrog Security Research, 2025
CRM data exfil via expired CSP domain (CVSS 9.4): ForcedLeak: Agent Risks Exposed in Salesforce Agentforce ↗︎ Levi & Tron (Noma Security), 2025
Cross-channel secret exfil via enterprise chat RAG: Data Exfiltration from Slack AI via Indirect Prompt Injection ↗︎ PromptArmor, 2024
Source code exfil via platform image proxy (CVSS 9.6): CamoLeak: Critical GitHub Copilot Vulnerability Leaks Private Source Code ↗︎ Mayraz (Legit Security), 2025
Full account takeover via agentic browser page summarization: Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet ↗︎ Chaikin & Sahib (Brave Security), 2025
Persistent cross-session exfil via memory poisoning: SpAIware: Persistent Long-Term-Memory Compromise of ChatGPT ↗︎ Rehberger (Embrace The Red), 2024

Tool-calling agents are vulnerable via multi-stage attacks

Cross-environment and long-chain attack chains: DREAM: Dynamic Red-teaming across Environments for AI Models ↗︎ Lu et al., 2025

Tool-calling agents are vulnerable via prompt injection to instruction integrity breaches (task derailment, tool-call hijacking)

Effective attack with multiple indirect injections in unknown order: ObliInjection: Order-Oblivious Prompt Injection Attack to LLM Agents with Multi-source Data ↗︎ Wang, Jia & Gong, 2025
Mix of DPI and IPI most effective: Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents ↗︎ Zhang et al., 2025
RL-finetuned models generate effective injections: RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection ↗︎ Wen et al., 2025
Automated direct prompt injection works: AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents ↗︎ Wang et al., 2025
Automated indirect prompt injection works: AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs ↗︎ Wang et al., 2026

Tool-calling agents are vulnerable via memory poisoning

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents ↗︎ Zhang et al., 2025

Tool-calling agents are vulnerable via thinking hijacking

Attacking thinking is effective: Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents ↗︎ Zhang et al., 2025

Tool-calling agents are vulnerable via tool selection

Optimization for tool retrieval and selection: Prompt Injection Attack to Tool Selection in LLM Agents ↗︎ Shi et al., 2025

Frontier reasoning LLMs are vulnerable via bijection attacks

Advanced reasoning allows to evade safety detection: Endless Jailbreaks with Bijection Learning ↗︎ Huang, Li & Tang, 2024

LLMs are vulnerable to data leakage (system prompt extraction, training data leakage)

RL-finetuned models are effective: LeakAgent: RL-based Red-teaming Agent for LLM Privacy Leakage ↗︎ Nie et al., 2025

LLMs are vulnerable to jailbreaks at runtime

RL-finetuned models are effective prompt generators: RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection ↗︎ Wen et al., 2025
Models finetuned on gradient derived suffixes are effective: AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs ↗︎ Liao & Sun, 2024
Logprobs can guide search for adversarial suffixes: Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks ↗︎ Andriushchenko, Croce & Flammarion, 2024
(Stealthy) prompts can be generated automatically: AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models ↗︎ Liu et al., 2023
Prompt augmentation (scrambling, capitalization, noise, and other modalities) is effective: Best-of-N Jailbreaking ↗︎ Hughes et al., 2024
Style variations to evade safety alignment and filter: Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models ↗︎ Bisconti et al., 2025
Structure variations to evade safety alignment and filter: The Structural Safety Generalization Problem ↗︎ Broomfield et al., 2025
Gradient-optimized suffixes (GCG) are effective: Universal and Transferable Adversarial Attacks on Aligned Language Models ↗︎ Zou et al., 2023, Improved Techniques for Optimization-Based Jailbreaking on Large Language Models ↗︎ Jia et al., 2024, Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks? ↗︎ Mu et al., 2025

LLM jailbreaks transfer badly on average (especially when aligned by preference optimization), but effectively in the long tail (especially when aligned by finetuning)

Suffixes are transferable, which is further improved: Stronger Universal and Transferable Attacks by Suppressing Refusals ↗︎ Huang et al., 2025
Models finetuned on gradient derived suffixes are effective: AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs ↗︎ Liao & Sun, 2024
Gradient-optimized suffixes are transferable: Universal and Transferable Adversarial Attacks on Aligned Language Models ↗︎ Zou et al., 2023

LLMs are vulnerable to jailbreaks via finetuning

Remove alignment through finetuning: Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility ↗︎ Murphy et al., 2025
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! ↗︎ Qi et al., 2023

Formalization

Threat models are coarse

External, user-level, internal: The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey ↗︎ Kim et al., 2026

Consequential downstream risks

Data leakage, unauthorized actions/data corruption, resource drain: The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey ↗︎ Kim et al., 2026
Zero-click exfiltration of mailbox/SharePoint/OneDrive/Teams from a single email: EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System ↗︎ Aim Security, 2025
Unauthorized financial actions and automated spear-phishing in user's writing style: Living off Microsoft Copilot ↗︎ Bargury & Sharbat (Zenity Labs), Black Hat USA 2024
Digital-to-physical consequences via smart-home actions from a single calendar invite: Invitation Is All You Need! Promptware Attacks Against Gemini for Workspace ↗︎ Yair, Nassi & Cohen (SafeBreach Labs), Black Hat USA 2025
Full host RCE and persistent C2 from a single webpage sentence: ZombAIs: From Prompt Injection to C2 with Claude Computer Use ↗︎ Rehberger (Embrace The Red), 2024

LLM alignment failures are driven by structural model deficiencies

Mismatched generalization, competing objectives, adversarial robustness, mixed attacks: A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models ↗︎ Peláez-González et al., 2025
Competing objectives and mismatched generalization: Jailbroken: How Does LLM Safety Training Fail? ↗︎ Wei et al., 2023

Real attacks target systems around ML models and little considered in defense research

Consider attacker viewpoint on every system component, and cost of attack: “Real Attackers Don't Compute Gradients”: Bridging the Gap Between Adversarial ML Research and Practice ↗︎ Apruzzese et al., 2022

(Comparative) model capability predicts LLM attack success

ASR derives from attacker-target model capability gap: Capability-Based Scaling Trends for LLM-Based Red-Teaming ↗︎ Panfilov et al., 2025

Securing LLM-based Systems

Defenses patch downstream consequences instead of upstream causes

Explicitly for AutoGPT, implicitly generally: The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey ↗︎ Kim et al., 2026

Large LLM judges are effective defenses

LLM detector finetuned against optimizing attacker: DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks ↗︎ Liu et al., 2025
Off-the-shelf LLM detector effective: PromptArmor: Simple yet Effective Prompt Injection Defenses ↗︎ Shi et al., 2025

Small LLM judges are ineffective defenses

RL-finetuned models circumvent PI defenses: RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection ↗︎ Wen et al., 2025
Tool-call hijacking despite defenses: Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents ↗︎ Zhang et al., 2025

Widening detection surface improves defenses

Semantically equivalent rewriting to match safety filters: The Structural Safety Generalization Problem ↗︎ Broomfield et al., 2025

Tool privilege control effective against grossly harmful actions

Least privilege preventing out-of-privilege actions of compromised LLM: Progent: Programmable Privilege Control for LLM Agents ↗︎ Shi et al., 2025

Model-level defenses are insufficient

RL-finetuned models circumvent PI defenses: RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection ↗︎ Wen et al., 2025
Structured query finetuned model ineffective on small models under optimization attack: StruQ: Defending Against Prompt Injection with Structured Queries ↗︎ Chen et al., 2025
Safety finetuning learns query classifications, not unlearning unsafe capabilities: What Makes and Breaks Safety Fine-tuning? A Mechanistic Study ↗︎ Jain et al., 2024

Prompt augmentation is an ineffective defense

Tool-call hijacking despite defenses: Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents ↗︎ Zhang et al., 2025