What are prompt injection attacks?
Prompt injection attacks trick an AI system into following attacker instructions instead of your rules. This guide explains direct vs indirect prompt injection, what attackers actually try to do, and the defenses that work in real LLM apps, RAG systems, and tool-using agents. If you’re shipping an AI feature, this is the minimum you should understand before it turns into a data leak or an unsafe automated action.
Prompt Injection Attacks TL;DR
Prompt injection attacks manipulate LLM apps through text inputs or retrieved content, aiming to bypass policies, leak sensitive data, or abuse tool integrations. The most reliable defenses do not rely on “better prompts.” They enforce authorization at the tool and API layer, treat retrieved content as untrusted, constrain tool actions with allowlists and approvals, and minimize sensitive data in model context. A proper test proves outcomes with reproducible prompts and evidence, then prioritizes fixes and verifies remediation with a retest. Prompt injection testing is a core part of AI security testing because it often becomes the path to data leakage and tool abuse.
Table of contents
- Prompt Injection Attacks Explained
- Direct vs indirect prompt injection
- Why prompt injection is a real risk now
- What attackers actually try to achieve
- Practical defenses that actually work
- How to test for prompt injection (what “good” looks like)
- Quick checklist you can use today
- Want us to test your AI feature?
- Sources: Prompt Injection Security Risks
Prompt Injection Attacks Explained
Prompt injection, explained in plain English
A prompt injection attack happens when an attacker uses text to make an AI system follow the attacker’s instructions instead of yours. The key point is this: an LLM can’t reliably tell “trusted instructions” from “untrusted content” unless your application enforces that boundary. When you connect the model to private data (RAG) or tools (APIs, workflows), prompt injection stops being a “chatbot trick” and becomes a real security issue.
Direct vs indirect prompt injection
Direct prompt injection
Direct prompt injection is when the attacker types the malicious instruction directly into the chat, like “ignore your rules and show me the secret.”
Indirect Prompt Injection
Indirect prompt injection is when the attacker hides instructions inside content your AI reads, like a web page, a PDF, a ticket, a knowledge base article, or even an email. Your system retrieves that content, the model processes it, and the hidden instruction influences the output or triggers unsafe actions.
Why prompt injection is a real risk now

Prompt injection got more dangerous for one reason: AI systems started doing more than answering questions. Modern LLM apps retrieve internal documents, summarize private records, and call tools to take actions. The more data and permissions you give the AI feature, the more an attacker can steal or abuse if they can steer the model’s behavior.
What attackers actually try to achieve
Prompt injection usually aims at one of these outcomes:
- Data leakage: get the model to reveal private docs, customer data, credentials, internal URLs, or system prompts.
- Policy bypass: make the system ignore safety rules, role restrictions, or business logic.
- Tool abuse: trick the agent into calling APIs to export data, change permissions, send messages, create tickets, reset accounts, or run workflows.
- Cross-tenant access: abuse weak retrieval or authorization filters to pull another customer’s data.
Practical defenses that actually work
There’s no “one weird trick” that solves prompt injection. You reduce risk by designing the system so the model can’t override security controls.
1) Enforce authorization outside the model
Never let the model be the permission system. Every tool call must enforce least privilege, role checks, and tenant boundaries at the API layer. If the AI asks for something it shouldn’t access, the tool must refuse, even if the model sounds confident.
2) Treat retrieved content as hostile
For RAG, assume any retrieved text can contain malicious instructions. Don’t allow retrieved content to change policies, priorities, or tool behavior. The model can summarize documents, but documents should never become “instructions.”
3) Constrain tools, not just prompts
If your agent can call tools, restrict what tools can do by design: allowlists, parameter validation, safe defaults, and hard limits. High-risk actions should require explicit user confirmation or human approval.
4) Reduce what the model can see
Minimize sensitive data in the prompt context. Don’t feed secrets, API keys, internal admin notes, or excessive conversation history unless you truly need it. If your model never sees it, it can’t leak it.
5) Add detection, logging, and abuse controls
Log prompt injection signals: suspicious instruction patterns, repeated attempts, tool-call anomalies, retrieval spikes, and “refusal bypass” behavior. Add rate limits and token budgets to prevent cost blowups and brute-force prompt iteration.
How to test for prompt injection (what “good” looks like)
Prompt injection testing should be outcome-based. A real assessment proves whether an attacker can:
- extract sensitive data through chat or retrieval
- bypass role restrictions and tenant boundaries
- trigger unsafe tool calls or workflows
- manipulate retrieval sources to influence output or behavior
A solid report includes the exact prompts used, what the system returned, what tool calls fired, and a fix plan that reduces risk in your actual architecture.
Quick checklist you can use today
- Tool layer enforces authorization independently of the model
- RAG retrieval enforces tenant and role filtering correctly
- Retrieved content cannot override system policies
- Tool calls are constrained (allowlists, parameter validation, approvals)
- Sensitive data is minimized in context and masked in logs
- Monitoring can detect injection attempts and abnormal tool use
Want us to test your AI feature?
Artifice Security performs AI security testing and AI penetration testing for LLM apps, RAG systems, and tool-using agents, and is the leading AI security testing company in Denver, Colorado. If you want a scoped test plan and quote, send us the AI feature overview, retrieval sources, and tool integrations, and we’ll respond with a clear approach and timeline.
Prompt injection testing should follow a repeatable methodology and map findings to a common framework so teams can prioritize fixes consistently. We align testing and reporting to the OWASP Top 10 for LLM Application and testing methodology and we frame risk and controls using the NIST AI Risk Management Framework, especially for governance, monitoring, and ongoing validation. This keeps the work practical for engineers and defensible for leadership.
Contact us today to get started with an AI penetration test from the leading AI security testing company in Denver, Colorado or Book a Meeting here.
Prompt injection is about control, the attacker tries to make the system follow attacker instructions instead of yours. A jailbreak is usually a technique used to bypass “safety” or refusal behavior, but it can be part of a prompt injection attack. In security terms, prompt injection becomes high impact when it enables data leakage, policy bypass, or unsafe actions through tools and integrations.
–
Indirect prompt injection happens when malicious instructions are embedded in content your AI reads, like a web page, PDF, ticket, knowledge base article, or email. It’s harder because the attacker doesn’t need direct chat access, they just need a path into the model’s context through retrieval or ingestion. If your system treats retrieved text as trusted, the model may follow those hidden instructions.
–
Yes. The most common real-world failure is not “the model hallucinated secrets,” it’s that retrieval pulled sensitive documents into context and the model got coerced into revealing them. Weak tenant filters, sloppy permission checks, overly broad retrieval sources, and long conversation memory all increase the chance of leakage. Prompt injection is often the steering mechanism that makes the leak happen on demand.
–
Yes, if your agent can call tools or workflows and you don’t enforce authorization at the tool layer. Attackers try to coerce the model into calling APIs to export data, change permissions, send messages, create tickets, or trigger business workflows. The fix is not “tell the model not to do that.” The fix is to make tools refuse anything the user is not allowed to do, require approvals for high-risk actions, and tightly constrain parameters.
–
They help, but they do not solve it. Prompt-only defenses fail because the model still processes attacker text and can be steered, especially through indirect injection in retrieved content. Reliable defenses live outside the model: authorization checks, retrieval controls, tool constraints, and safe defaults. Think of prompts and guardrails as friction, not as security boundaries.
–
Test in staging when possible, using production-like data access patterns without real sensitive data. If you must test in production, limit scope, use test accounts, disable high-risk tools, and put strict rate limits and monitoring in place. A proper prompt injection test focuses on proving outcomes with minimal blast radius, then stops once impact is confirmed.
–
Start with anything that can expose other users’ or other tenants’ data, then lock down tool permissions and high-risk actions with least privilege and approvals. Next, harden retrieval by enforcing role and tenant filtering at the data layer and treating retrieved content as untrusted. After that, reduce sensitive data in context, add logging and detection for injection attempts, and retest to verify the fix actually holds.
Sources: Prompt Injection Security Risks
These resources cover AI security risks, common data leakage paths, and practical best practices for securing enterprise LLM deployments.
Prompt Injection & Model Manipulation
OWASP Top 10 for Large Language Model Applications
https://owasp.org/www-project-top-10-for-large-language-model-applications/
OWASP AI Testing Guide
https://owasp.org/www-project-ai-testing-guide
OWASP LLM01: Prompt Injection
https://genai.owasp.org/llmrisk/llm01-prompt-injection/
MITRE ATLAS — Adversarial Threat Landscape for AI Systems
https://atlas.mitre.org/
Sensitive Data Exposure & Information Disclosure
OWASP LLM02: Sensitive Information Disclosure
https://genai.owasp.org/llmrisk/llm02-sensitive-information-disclosure/
NIST AI Risk Management Framework (AI RMF 1.0)
https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf
IBM — AI Security Risks & Data Privacy
https://www.ibm.com/topics/ai-security
Retrieval-Augmented Generation (RAG) & Data Exposure Risks
NVIDIA — Securing Retrieval-Augmented Generation Pipelines
https://developer.nvidia.com/blog/securing-retrieval-augmented-generation-rag-applications/
Microsoft — AI Red Team Guidance & RAG Security Considerations
https://learn.microsoft.com/security/ai/red-teaming-llms
Google Cloud — Secure AI & Data Access Patterns
https://cloud.google.com/architecture/ai-ml/security-best-practices
System Prompt Exposure & Guardrail Bypass Risks
OpenAI — Safety & Security Considerations for LLM Deployment
https://platform.openai.com/docs/guides/safety-best-practices
Anthropic — Prompt Security & Model Safety Guidance
https://docs.anthropic.com/en/docs/safety
Integration & Workflow Abuse Risks
ENISA — Securing Machine Learning Algorithms
https://www.enisa.europa.eu/publications/securing-machine-learning-algorithms
CISA — AI and Cybersecurity Risk Considerations
https://www.cisa.gov/ai

