6. Output and Trust Exploitation
The Package That Never Existed
A development team uses an AI coding assistant to build a Node.js microservice. The assistant recommends installing a utility package called flask-http-helpers for request validation. The developer runs npm install flask-http-helpers and the package installs successfully. The code works, tests pass, and the service ships to production.
There’s just one problem: flask-http-helpers didn’t exist six months ago. The AI hallucinated the package name – it generated a plausible-sounding but fictional dependency. An attacker, aware that LLMs consistently hallucinate certain package names, registered that exact name on npm with malicious code. The package collects environment variables, API keys, and database credentials, then sends them to a remote server. The development team just installed a supply chain backdoor through a package that only exists because an AI made it up.
This is output exploitation – and it represents a category of attacks where the danger isn’t in what goes into an AI system, but in what comes out of it.
What will I get out of this?
By the end of this section, you will be able to:
- Explain how AI hallucinations can be weaponized into real-world supply chain attacks
- Identify the four major categories of output exploitation: hallucination weaponization, excessive agency, data leakage, and improper output handling
- Trace an output exploitation attack chain from AI-generated content to downstream system compromise
- Describe how sensitive data leaks through AI outputs including training data extraction and membership inference
- Demonstrate improper output handling vulnerabilities where AI-generated content becomes an injection vector
- Assess human over-trust risks and the role of automation bias in output exploitation
- Cite specific incidents involving output exploitation with companies, dates, and outcomes
Hallucination Weaponization
LLM09: Misinformation
Hallucination – an AI model generating plausible but factually incorrect content – is typically discussed as an accuracy problem. But in the right context, hallucination becomes a weapon.
Package Hallucination Attacks
The most concrete demonstration of weaponized hallucination comes from package hallucination research. In a landmark study published at USENIX Security 2025, researchers systematically tested 30 LLMs across 30,000 code generation prompts and found that models hallucinate package names with alarming consistency:
- 19.7% of generated packages were hallucinated (did not exist in any real registry)
- The same models hallucinated the same package names repeatedly – making attacks predictable and scalable
- Across 10 repeated runs, 43% of hallucinated packages appeared in every single run
This consistency is what transforms a reliability problem into a security vulnerability. An attacker doesn’t need to guess which fake packages an LLM will recommend – they can run the same prompts, catalog the hallucinated names, and register them on npm, PyPI, or other package registries.
graph TD
subgraph "Package Hallucination Attack Pipeline"
A["Attacker runs code<br/>generation prompts<br/>against popular LLMs"]
B["Catalogs consistently<br/>hallucinated package names"]
C["Registers hallucinated<br/>names on npm/PyPI<br/>with malicious code"]
D["Developer asks AI<br/>for code recommendation"]
E["AI recommends same<br/>hallucinated package"]
F["Developer installs<br/>malicious package"]
G["Attacker gains access<br/>to credentials, env vars,<br/>source code"]
A -->|"Discovery"| B
B -->|"Registration"| C
C -.->|"Lies in wait"| D
D -->|"Hallucination"| E
E -->|"Trusted output"| F
F -->|"Exploitation"| G
end
style A fill:#b71c1c,stroke:#7f0000,color:#fff
style C fill:#e53935,stroke:#b71c1c,color:#fff
style G fill:#b71c1c,stroke:#7f0000,color:#fff
Why this is different from traditional typosquatting: Traditional package name attacks (typosquatting) rely on developers making typos. Package hallucination attacks rely on AI models making consistent, repeatable errors – which are far more predictable and far harder for developers to detect, because the recommendation comes from a trusted AI assistant.
Beyond Packages: Other Weaponized Hallucinations
Package hallucination is the most researched example, but the pattern applies anywhere AI outputs are trusted as factual:
- API endpoint hallucination: AI generates code calling API endpoints that don’t exist – an attacker registers the domain and captures the requests (including authentication tokens)
- Legal citation hallucination: AI generates plausible but fictional case citations, which have already resulted in court sanctions against attorneys who submitted them
- Configuration hallucination: AI recommends security configurations with plausible but incorrect settings that weaken system defenses
Excessive Agency Exploitation
LLM06: Excessive Agency
You encountered LLM06 in Section 5 as the bridge to the OWASP Agentic AI Top 10. Here we examine it from the output perspective: what happens when an AI system takes actions beyond what the user intended or authorized.
Excessive agency in output exploitation manifests as:
- Scope creep in actions: An agent asked to “clean up the database” deletes records that should have been archived
- Unauthorized automation: An AI email assistant asked to “draft a reply” sends the email without confirmation
- Chain-of-action escalation: An agent completing a multi-step task adds steps the user didn’t request – such as “helpfully” sharing results with colleagues via email or Slack
Key insight: Excessive agency attacks don’t require an external attacker. The AI system itself becomes the threat actor when it takes actions beyond its intended scope. This makes it one of the hardest output exploitation categories to detect – the system is “working as designed” from a technical standpoint.
Data Leakage Through Outputs
LLM02: Sensitive Information Disclosure
Every AI model retains patterns from its training data. When those patterns include sensitive information – personal data, proprietary code, internal documents, or credentials – the model can leak that information through its outputs.
Major leakage vectors:
Training Data Extraction
Large language models can be prompted to regurgitate portions of their training data verbatim. Researchers have demonstrated extraction of:
- Personally identifiable information (PII): Names, email addresses, phone numbers that appeared in training data
- API keys and credentials: Secrets accidentally included in public code repositories that were part of the training corpus
- Proprietary code: Code snippets from private repositories that were inadvertently included in training data
Membership Inference
Even when a model doesn’t leak exact training data, an attacker can determine whether specific data was part of the training set by observing the model’s confidence levels and output patterns. This is a privacy violation in regulated contexts (GDPR, HIPAA) where the mere confirmation that data was used in training may be actionable.
Improper Output Handling
LLM05: Improper Output Handling
When AI-generated outputs are passed to downstream systems without proper sanitization, the AI becomes an injection vector. The model itself isn’t compromised – but its outputs compromise the systems that consume them.
graph LR
subgraph "Output Handling Attack Chain"
User["User Request"]
LLM["LLM Generates<br/>Response"]
App["Application<br/>Processes Output"]
DB["Database"]
Browser["User Browser"]
API["Downstream API"]
User -->|"Crafted prompt"| LLM
LLM -->|"Output contains<br/>malicious payload"| App
App -->|"SQL Injection"| DB
App -->|"XSS Payload"| Browser
App -->|"Command Injection"| API
end
style LLM fill:#ffa726,stroke:#e65100,color:#fff
style DB fill:#ef5350,stroke:#c62828,color:#fff
style Browser fill:#ef5350,stroke:#c62828,color:#fff
style API fill:#ef5350,stroke:#c62828,color:#fff
Attack scenarios:
- XSS through LLM-generated HTML: A chatbot generates a response containing
<script>tags. If the application renders this as HTML without sanitization, the script executes in the user’s browser. - SQL injection through LLM-generated queries: An application uses an LLM to generate SQL queries from natural language. A user crafts a prompt that causes the LLM to generate a query containing
'; DROP TABLE users; --. - Command injection through LLM-generated shell commands: An AI assistant generates a shell command that includes user-controlled input without escaping, allowing arbitrary command execution on the server.
- SSRF through LLM-generated URLs: An LLM generates URLs that point to internal services (e.g.,
http://169.254.169.254/for cloud metadata), enabling server-side request forgery.
The Trust Boundary Problem
Improper output handling is fundamentally a trust boundary violation. Applications treat LLM output as trusted data when it should be treated as untrusted user input. Every piece of AI-generated content that flows into a downstream system – a database query, an HTML page, an API call, a shell command – must be validated and sanitized just like any other external input.
Human Over-Trust
WarningASI09: Human-Agent Trust Exploitation
Human over-trust isn’t a technical vulnerability in the traditional sense – it’s a behavioral vulnerability that amplifies every other output exploitation category. When users trust AI outputs without verification, every hallucination, every data leak, and every excessive action goes unchecked.
Automation bias – the tendency to favor suggestions from automated systems over contradictory information from non-automated sources – is well-documented in aviation, healthcare, and manufacturing. AI systems trigger the same bias, often more intensely because:
- AI outputs are presented with high confidence regardless of actual certainty
- AI assistants have often been right before, building a trust baseline
- Verifying AI output requires effort and expertise that the user was trying to avoid by using AI in the first place
- AI outputs are formatted professionally, creating an appearance of authority
Practical implications for security:
- Code review agents that flag “no issues found” discourage human reviewers from looking deeper
- AI-generated security assessments may be accepted without validation
- AI summaries of lengthy documents may omit critical details that a human reader would catch
- AI-recommended configurations may be deployed without security review because “the AI checked it”
Case Studies
Case Study 1: Samsung ChatGPT Data Leak (2023)
LLM02: Sensitive Information DisclosureCompany: Samsung Electronics Date: March-April 2023 (three separate incidents within 20 days) Product: ChatGPT (used by Samsung semiconductor engineers)
In three separate incidents over a 20-day span, Samsung semiconductor engineers pasted confidential information into ChatGPT:
- Incident 1: An engineer pasted proprietary source code from a semiconductor database to ask ChatGPT to identify and fix bugs
- Incident 2: An engineer submitted confidential code related to yield and defect measurement equipment
- Incident 3: An engineer pasted an internal meeting transcript and asked ChatGPT to generate meeting minutes
In each case, the confidential data became part of OpenAI’s training pipeline (under the default data usage policy at the time). Samsung’s proprietary semiconductor designs, manufacturing processes, and internal strategy discussions were now potentially accessible to any future ChatGPT user whose queries triggered relevant pattern completions.
Outcome: Samsung initially restricted and then banned internal use of ChatGPT and similar generative AI tools. The company developed internal AI tools with data containment controls. The incident became a watershed case study for enterprise AI data governance policies worldwide. OpenAI subsequently introduced the ability to opt out of training data collection, but by the time Samsung data was submitted, the default policy included it.
Case Study 2: Package Hallucination Attacks (USENIX 2025)
LLM09: MisinformationResearchers: Lanyado et al., Vulcan Cyber / USENIX Security 2025 Date: 2025 (published) Scope: 30 LLMs tested across 30,000 code generation prompts
The USENIX Security 2025 study “We Have a Package for You” provided the first systematic, large-scale analysis of package hallucination as a security vulnerability. Key findings:
- Scope: 19.7% of all packages recommended by LLMs across the study did not exist in any real package registry
- Consistency: 43% of hallucinated packages appeared across all 10 repeated experimental runs, making them highly predictable targets for attackers
- Cross-model correlation: Different LLMs hallucinated many of the same package names, meaning an attacker’s registered malicious package could be recommended by multiple AI assistants
- Language variation: Python packages were hallucinated at lower rates than JavaScript packages, but both rates were significant enough for exploitation
Outcome: The research led to calls for package registry protections against AI-hallucinated names, integration of real-time package verification into AI coding assistants, and new defensive research into hallucination detection specifically for code generation contexts. Several package registries began exploring “hallucination-aware” name reservation systems.
Key Takeaways
- AI hallucinations become weaponized when attackers register package names, domains, or API endpoints that models consistently and predictably fabricate.
- Sensitive data leaks through AI outputs via training data extraction (divergence attacks), memorization of PII and credentials, and membership inference.
- Improper output handling turns the LLM into an injection vector – AI-generated content can carry XSS, SQL injection, and command injection payloads into downstream systems.
- Automation bias causes humans to over-trust AI outputs, especially after the system has built a track record of accuracy.
- Every piece of AI-generated content that flows into a downstream system must be validated and sanitized as untrusted input, not trusted data.
Test Your Knowledge
Ready to test your understanding of output exploitation techniques? Head to the quiz to see how well you can identify hallucination weaponization, data leakage vectors, and improper output handling vulnerabilities.
Up next
We’ve now covered attacks that target AI outputs and the trust placed in them. In the next section, we’ll explore a rapidly growing threat surface: Small Language Models (SLMs). As organizations deploy smaller models on edge devices, phones, and IoT systems, they’re discovering that smaller doesn’t mean safer – and in many cases, smaller models are easier to attack.