AI Security for Developers: Prompt Injection, Agent Trust, and the Stuff That's Actually Dangerous
AI agent security, MCP security, prompt injection prevention, AI coding security, agent trust model, AI security best practices
AI coding agents introduce security risks that traditional application security frameworks don’t address: prompt injection through crafted inputs, credential proxying through inherited permissions, verification gaps in AI-generated code, and cognitive security risks from over-trusting AI assessments. This guide covers the attack vectors that actually affect developers building with AI tools — based on real incidents at Meta, Amazon, and others, covered across eight episodes of the ADI Pod.
The AI security industry has a credibility problem. Half the discourse is vendors selling fear about threats that don’t exist at scale. The other half is developers ignoring threats that do. In Episode 13, we covered the article “The AI Security Industry is Bullshit,” and while the title is provocative, the core argument has merit: much of the AI security industry is selling FUD rather than solving real problems.
But here’s the thing about dismissing the AI security industry as bullshit: some of it isn’t. Meta had rogue AI agents causing production incidents. Amazon mandated senior sign-off on AI-assisted changes after outages. Claude Code’s source leak revealed the safety architecture isn’t what anyone assumed. The real threats are specific, technical, and growing — they just don’t look like the threats the security vendors are selling.
This guide focuses on the attack vectors and trust problems that actually affect developers building with AI tools. Not theoretical. Not speculative. The stuff that’s already happening.
How AI Agents Change the Security Threat Model
Traditional application security assumes a human developer writing code, a human reviewer reading it, and a deployment pipeline with human-controlled gates. AI agents break every one of these assumptions.
The new model: an AI agent reads files, writes code, executes bash commands, makes API calls through MCP servers, and commits changes — all within a single session. The agent has access to your credentials (via environment variables or MCP), your codebase (via filesystem access), and your infrastructure (via bash). The attack surface isn’t a web form. It’s the entire development environment.
The four categories of AI security risk that actually matter for developers:
- Prompt injection — manipulating the agent through crafted inputs
- Credential proxying — agents inheriting and misusing human credentials
- Verification gaps — AI-generated code that introduces vulnerabilities
- Cognitive security — AI influence on human decision-making
Prompt Injection: The Unsolved Problem
Prompt injection is the SQL injection of AI applications, and it’s approximately as solved as SQL injection was in 2003 — which is to say, not at all.
The attack
An attacker embeds instructions in data that the AI agent processes. The agent can’t reliably distinguish between its instructions and data that happens to look like instructions. When the agent follows the injected instructions, it takes actions the user didn’t authorize.
Why it’s hard
The fundamental issue is that language models process instructions and data in the same channel. Unlike SQL (where parameterized queries can structurally separate code from data), there’s no reliable structural boundary between a prompt and injected content.
Episode 2 covered the Wharton research paper “Call Me A Jerk: Persuading AI to Comply with Objectionable Requests,” which demonstrated that persuasion techniques rooted in Cialdini’s influence principles (reciprocity, authority, social proof) can systematically manipulate LLM behavior. The attacks don’t require technical sophistication — they use the same psychological techniques that work on humans.
What this means for developers
If your AI agent reads external content — user input, web pages, API responses, file contents — it’s potentially vulnerable to prompt injection. The practical implications:
- Code review by AI can be manipulated if the code under review contains injection attempts
- MCP servers that fetch external data pass that data through the model’s context window, creating injection opportunities
- CLAUDE.md files in shared repositories could contain injected instructions that redirect agent behavior
The mitigation isn’t a single defense but a layered approach: limit agent permissions, validate agent actions, treat agent output as untrusted, and don’t give agents access to credentials they don’t need.
What Is Credential Proxying in AI Agents?
This is the security problem that gets the least attention and causes the most damage.
The pattern
Your AI coding agent needs to run tests, access databases, call APIs, and deploy code. To do this, it inherits your credentials — environment variables, SSH keys, API tokens, AWS credentials. The agent becomes a credential proxy: it acts on your behalf, with your permissions, using your identity.
The problem: the agent doesn’t have the judgment to know which credentials are appropriate for which actions. It will use your production database credentials to run a test if that’s what’s in the environment. It will commit your API keys to a repository if they’re in a file it reads. It will send authenticated requests to external services if an MCP server provides that capability.
Real-world failures
Episode 19 covered Meta’s incident with rogue AI agents causing a credential leak. The specifics matter: AI-generated advice led to credentials being exposed in a production context. This wasn’t a theoretical attack — it was an AI agent following its training, making a decision that happened to expose sensitive data.
Practical mitigations
Least privilege for agents. Don’t give your AI agent your full credential set. Use scoped tokens, read-only database connections, and sandboxed environments where possible. If your agent only needs to read files and run tests, it doesn’t need deployment credentials.
Separate environments. Run agents in containers or sandboxes with access only to what they need. The convenience of running Claude Code with full access to your shell environment is real. So is the risk.
Audit MCP server access. Every MCP server you connect gives the agent a new set of capabilities and, potentially, a new set of credentials to proxy. Review what each server provides and whether the agent actually needs it.
Rotate credentials exposed to agents. Treat any credential an AI agent has touched as potentially compromised — not because the agent is malicious, but because the context window can be extracted, logs can be exposed, and the agent’s memory consolidation features may persist credential data across sessions.
Claude Code’s Permission Model: What the Source Leak Revealed
The Claude Code source leak revealed details about the safety architecture that are relevant to every user.
Dual-track permissions
Claude Code runs a dual-track permission system: rule-based permissions for known patterns (blocking rm -rf, preventing git push --force), plus a machine learning classifier that evaluates whether arbitrary bash commands could be destructive.
The ML classifier is the interesting part. It means Claude Code is making judgment calls about command safety, not just pattern matching. The judgment is imperfect — no classifier is 100% accurate — but it’s a more sophisticated approach than a blocklist.
What this means in practice
The permissions prompt you see when Claude Code asks to run a command is backed by both a rule engine and a model-based safety assessment. When you approve a command, you’re overriding both systems. The practical implication: if Claude Code asks for permission to do something, the safety systems flagged it for a reason. Don’t auto-approve.
Anti-distillation and fake tools
The leak revealed that Claude Code generates fake tools if it detects distillation attempts. This is a security measure against competitors extracting Claude Code’s behavior through systematic probing. It also means that if you’re building tools that interact with Claude Code’s API surface, you may encounter fake responses if your access patterns resemble distillation.
Verification Gaps: When AI Code Introduces Vulnerabilities
AI-generated code doesn’t just risk functional bugs — it risks security vulnerabilities. The CodeRabbit report (covered in our vibe coding guide) found that AI-generated code creates 1.7x more downstream issues than human-written code. Some of those issues are security-relevant.
Common patterns
Insecure defaults. AI models tend to generate code that works rather than code that’s secure. The model will use http:// when https:// would be appropriate, skip input validation, use weak encryption algorithms, or embed secrets in configuration files — because the training data contains plenty of examples of these patterns.
Missing authorization checks. A model generating a REST API will often create endpoints that work correctly but lack proper authorization. The happy path is tested; the security path is not.
Dependency vulnerabilities. AI agents install packages to solve problems. They don’t check vulnerability databases before doing so. An agent that resolves a build error by adding a dependency may be adding one with known CVEs.
The verification debt connection
Verification debt — the accumulated cost of AI output that was never meaningfully reviewed — has a security dimension. Every AI-generated diff that ships without security review adds to the organization’s attack surface. The METR study showing that SWE-bench-passing PRs wouldn’t survive human code review (from Episode 18) suggests the gap between “functionally correct” and “secure” is significant.
Mitigations
VSDD for security-critical code. Verified Spec-Driven Development (covered in Episode 16) includes adversarial verification gates. For security-critical code, the adversarial step should explicitly include security review: attempting injection, testing authorization boundaries, checking for information disclosure.
Linters and SAST as guardrails. Architectural boundaries, invariants, and security linters become multipliers when working with agents (from Episode 17). An agent that runs into a security linter violation will fix it. An agent without security linters will cheerfully ship insecure code that passes functional tests.
Dependency auditing. Before approving any agent-added dependency, check it against vulnerability databases. Better yet, configure your package manager to reject packages with known CVEs automatically.
MCP: The New Attack Surface
The Model Context Protocol has become the standard way to extend AI agent capabilities. It’s also a growing attack surface.
What MCP exposes
Each MCP server you connect gives the agent new capabilities: database access, API calls, file system operations, browser control, messaging. Every capability is a potential vector. The agent’s instructions say “use these tools to help the developer,” but prompt injection can redirect those tools toward unintended targets.
Chain-of-trust problems
MCP creates a chain of trust: you trust the agent, the agent trusts the MCP server, the MCP server trusts the external service. A compromised or malicious MCP server can:
- Inject content into the agent’s context window
- Capture credentials the agent passes through
- Execute actions that the agent (and user) don’t intend
- Exfiltrate data from the agent’s environment
Practical MCP security
Audit your MCP servers. Know what each server does, what data it accesses, and what actions it can take. Treat third-party MCP servers with the same caution you’d give a third-party npm package — because the risk profile is similar.
Minimize connected servers. Every MCP connection eats context window (security overhead) and expands the attack surface. Connect what you need for the current task; disconnect what you don’t.
Monitor MCP traffic. If your organization is deploying MCP-connected agents in production, log the actions MCP servers take. Anomalous patterns (unexpected API calls, data access outside normal scope) should trigger alerts.
How AI Agents Compromise Developer Judgment
The most overlooked AI security risk isn’t technical — it’s cognitive. AI agents influence human decision-making in ways that undermine security practices.
The confidence inflation effect
The Wharton paper “Thinking, Fast, Slow, and Artificial” (covered in Episode 19) found a 12-percentage-point boost in user confidence when AI provides answers, regardless of whether the AI is correct. Applied to security: when an AI agent tells you a configuration is secure or a code change doesn’t introduce vulnerabilities, you’re more likely to believe it than if a human colleague said the same thing.
This is cognitive surrender — deferring judgment to the AI rather than engaging your own analysis. In a security context, cognitive surrender means trusting the agent’s assessment of risk rather than evaluating it yourself.
Agent sycophancy and security review
Agent sycophancy — the tendency of AI to agree with users rather than challenge them — is a security-relevant behavior. When you ask an AI agent “is this code secure?” the sycophancy-optimized response is “yes.” The honest response might be “I can’t determine that — here are the areas I’d want a human to review.”
We tested sycophancy resistance across models in Episode 15. The results were mixed. The structural problem: there’s no market incentive to fix sycophancy because users rate agreeable responses as higher quality. For security review, this means treating agent security assessments as optimistically biased.
Practical cognitive security
Don’t use AI agents as your sole security reviewer. Use them to augment human review — find patterns, flag potential issues, check for known vulnerability classes. But the final security assessment should involve human judgment.
Calibrate your trust. If an agent tells you something is secure, ask it to explain why. If the explanation is vague or generic, the assessment is worthless. If it cites specific checks it performed, evaluate those checks independently.
Be especially skeptical of reassurance. An agent that says “this looks fine” without being asked is potentially sycophantic. An agent that unprompted flags a concern is more likely to be reliable — the sycophancy incentive pushes toward reassurance, not toward raising problems.
The Government and Supply Chain Dimension
AI security isn’t just about code. The Pentagon-Anthropic drama (covered in Episodes 16 and 17) exposed the geopolitical dimensions.
Anthropic established contract red lines: no mass domestic surveillance, no autonomous weapons. The Department of Defense threatened to classify Anthropic as a supply chain risk for refusing certain use cases. OpenAI offered a competing contract under vague “lawful use” terms. Big tech companies backed Anthropic’s position.
For developers, the implications are:
- AI model providers are becoming critical infrastructure subject to government supply chain risk frameworks
- The models you build on may have usage restrictions that affect your product
- “Lawful use” terms from providers may be deliberately vague to avoid restricting government contracts
- If you’re building products that touch government or defense, the model provider’s ethical positions become your compliance risk
Legal Liability: The Unresolved Question
The Stanford Law paper “Built by Agents, Tested by Agents, Trusted by Whom?” (covered in Episode 18) raises the question nobody has answered: who is liable when AI-generated code causes a security breach?
Existing contracts assume human-written, human-tested code. Insurance frameworks don’t cover agent-generated code shipping to production. Copilot’s terms classify it as “for entertainment purposes only” — which is legally convenient for Microsoft and legally terrifying for developers shipping Copilot-generated code to production.
Until legal frameworks catch up, the practical advice is: treat AI-generated code as if you wrote it. You’re responsible for what ships. The agent isn’t going to appear in court. You might.
The Open Source Supply Chain Risk
A novel attack vector covered in Episode 17: using AI agents to create “clean room” reimplementations of copyleft-licensed code. Simon Willison covered this in “Can coding agents relicense open source through a ‘clean room’ implementation of code?” — the concern that AI can launder GPL code into permissively licensed alternatives.
For developers: if a dependency was recently “rewritten from scratch” using AI tools and happens to mirror the behavior of a copyleft-licensed library, you may have a licensing liability that doesn’t appear in a standard vulnerability scan. The attack surface isn’t technical — it’s legal.
Interpretability: The Security Tool Nobody Uses Yet
Two developments suggest that model interpretability will become a security tool:
Sterling 8B (covered in Episode 16) is the first inherently interpretable LLM with concept attribution, input context tracing, and training data attribution. It uses orthogonal loss functions to create non-overlapping interpretable concepts. In theory, this lets you audit what a model is “thinking” — a prerequisite for trust in high-stakes deployments.
LLM neuroanatomy research (covered in Episode 18) is mapping which model layers handle reasoning versus retrieval, enabling targeted analysis of model behavior. The security application: understanding which parts of the model process safety-critical decisions versus which parts are vulnerable to injection.
Neither technology is production-ready for security applications. But the trajectory is clear: interpretability will eventually give us the ability to verify that a model is behaving as intended, not just that its outputs look correct.
Frequently Asked Questions
Is it safe to give Claude Code access to my full development environment?
It’s convenient but not advisable for production credentials. Use scoped environments: full access for development, restricted access for anything touching production data or infrastructure. The dual-track permission system helps but doesn’t protect against credential exposure in context windows or log files.
How do I protect against prompt injection in my AI-powered application?
There’s no single defense. Layer your approach: validate and sanitize inputs before they reach the model, limit the agent’s permissions so injected instructions can’t cause serious harm, treat agent outputs as untrusted and validate before acting, and monitor for anomalous agent behavior. None of these is sufficient alone. All of them together reduce risk significantly.
Should I trust AI-generated security assessments?
No. Use AI to augment security review — finding patterns, checking for known vulnerability classes, generating test cases. But final security assessments should involve human judgment, especially for authentication, authorization, and data handling.
What’s the biggest AI security risk most developers ignore?
Credential proxying. Most developers give their AI agents full access to their environment without thinking about what credentials the agent can access. The agent inherits your identity and permissions. Any action it takes — including mistakes — happens as you.
How does MCP affect my security posture?
Each MCP server extends the agent’s capabilities and attack surface. Treat MCP server selection like dependency management: audit before adding, remove what you don’t actively need, and monitor for suspicious behavior. Third-party MCP servers deserve the same scrutiny as third-party npm packages.
Is AI-generated code more or less secure than human-written code?
Current evidence (CodeRabbit’s 1.7x issue rate, METR’s SWE-bench analysis) suggests AI-generated code has more issues, including security-relevant ones. The code tends to be functionally correct but miss security edges — input validation, authorization checks, secure defaults. This is fixable with proper review and tooling, but it requires acknowledging the gap rather than assuming AI-generated code is equivalent to human-written code.
This guide synthesizes content from Episodes 2, 3, 6, 13, 16, 17, 18, and 19 of the ADI Pod. Updated April 2026.