Prompt Injection Attacks, What They Are and How to Defend Against Them

by | Feb 16, 2026 | Compliance, Penetration Testing, Research







Infographic showing why prompt injection is a real risk now, with an attacker targeting an AI system connected to internal documents, private records, and tool integrations.







What is the difference between prompt injection and jailbreaks?

Prompt injection is about control, the attacker tries to make the system follow attacker instructions instead of yours. A jailbreak is usually a technique used to bypass “safety” or refusal behavior, but it can be part of a prompt injection attack. In security terms, prompt injection becomes high impact when it enables data leakage, policy bypass, or unsafe actions through tools and integrations.

What is indirect prompt injection and why is it harder to stop?

Indirect prompt injection happens when malicious instructions are embedded in content your AI reads, like a web page, PDF, ticket, knowledge base article, or email. It’s harder because the attacker doesn’t need direct chat access, they just need a path into the model’s context through retrieval or ingestion. If your system treats retrieved text as trusted, the model may follow those hidden instructions.

Can prompt injection cause real data leaks in RAG systems?

Yes. The most common real-world failure is not “the model hallucinated secrets,” it’s that retrieval pulled sensitive documents into context and the model got coerced into revealing them. Weak tenant filters, sloppy permission checks, overly broad retrieval sources, and long conversation memory all increase the chance of leakage. Prompt injection is often the steering mechanism that makes the leak happen on demand.

Can prompt injection make an AI agent take unauthorized actions?

Yes, if your agent can call tools or workflows and you don’t enforce authorization at the tool layer. Attackers try to coerce the model into calling APIs to export data, change permissions, send messages, create tickets, or trigger business workflows. The fix is not “tell the model not to do that.” The fix is to make tools refuse anything the user is not allowed to do, require approvals for high-risk actions, and tightly constrain parameters.

Do guardrails and “prompt engineering” actually prevent prompt injection?

They help, but they do not solve it. Prompt-only defenses fail because the model still processes attacker text and can be steered, especially through indirect injection in retrieved content. Reliable defenses live outside the model: authorization checks, retrieval controls, tool constraints, and safe defaults. Think of prompts and guardrails as friction, not as security boundaries.

How do you test for prompt injection safely in production?

Test in staging when possible, using production-like data access patterns without real sensitive data. If you must test in production, limit scope, use test accounts, disable high-risk tools, and put strict rate limits and monitoring in place. A proper prompt injection test focuses on proving outcomes with minimal blast radius, then stops once impact is confirmed.

What should we fix first if we find prompt injection issues?

Start with anything that can expose other users’ or other tenants’ data, then lock down tool permissions and high-risk actions with least privilege and approvals. Next, harden retrieval by enforcing role and tenant filtering at the data layer and treating retrieved content as untrusted. After that, reduce sensitive data in context, add logging and detection for injection attempts, and retest to verify the fix actually holds.





Have any questions?

Fill out the form below

Leading-Edge Penetration Testing

Services