Section 6 Quiz :: Introduction to AI Security

Section 6 Quiz :: Introduction to AI Security https://example.org/chapter2/s6/activity/index.html Test Your Knowledge: Output and Trust Exploitation Let’s see how much you’ve learned! This quiz tests your understanding of hallucination weaponization, package hallucination attacks, excessive agency, data leakage, improper output handling, and human over-trust – including the Samsung case study and the USENIX 2025 research. --- shuffle_answers: true shuffle_questions: false --- ## A development team installs an npm package recommended by their AI coding assistant. The package works correctly but secretly exfiltrates environment variables. Investigation reveals the package only exists because the AI consistently hallucinated that package name across multiple conversations. What type of attack is this? > Hint: Think about who created the package and why the AI recommended it. - [ ] Supply chain attack -- the legitimate package was compromised > The package was never legitimate. It was created by an attacker specifically because the AI consistently hallucinated that name. This is a hallucination-driven attack, not a traditional supply chain compromise. - [x] Package hallucination attack -- an attacker registered a malicious package using a name that LLMs consistently hallucinate, exploiting the predictability of AI-generated package recommendations > Correct! This maps to LLM09: Misinformation (hallucination weaponized into a supply chain attack). Research from USENIX Security 2025 showed that 19.7% of AI-recommended packages are hallucinated, and 43% of those appear in every experimental run. Attackers catalog consistently hallucinated names and register them with malicious code. This is more dangerous than typosquatting because the recommendation comes from a trusted AI assistant. - [ ] Prompt injection -- the AI was tricked into recommending a malicious package > No prompt injection occurred. The AI genuinely hallucinated the package name due to its training patterns. The attacker exploited the predictable hallucination, not the AI's input processing. - [ ] Data poisoning -- the AI's training data was corrupted to recommend specific packages > No training data was corrupted. The AI hallucinated the package name naturally. The attacker exploited this predictable behavior by registering the hallucinated name. ## In the Samsung ChatGPT data leak (2023), three separate incidents over 20 days involved engineers pasting confidential semiconductor data into ChatGPT. What made this a data leakage event rather than a prompt injection attack? > Hint: Consider who initiated the data exposure and whether any external attacker was involved. - [ ] The attacker used prompt injection to extract Samsung's data from ChatGPT's training set > No external attacker was involved. Samsung engineers voluntarily submitted the confidential data themselves. - [x] No attacker was involved -- Samsung engineers voluntarily submitted proprietary source code, defect data, and meeting transcripts, which became part of OpenAI's training pipeline under the default data usage policy > Correct! This maps to LLM02: Sensitive Information Disclosure. The Samsung case is a data leakage event because the sensitive data was provided to the AI by authorized users, not extracted by an attacker. Under OpenAI's default policy at the time, submitted data entered the training pipeline, making Samsung's semiconductor designs potentially accessible to future users. Samsung subsequently banned ChatGPT internally and developed their own AI tools with data containment controls. - [ ] ChatGPT was specifically designed to collect semiconductor industry data > ChatGPT is a general-purpose AI assistant. It wasn't targeting Samsung specifically. The data was submitted voluntarily by Samsung employees. - [ ] The data was only stored temporarily and deleted after each session > Under the default data usage policy at the time, submitted data became part of the training pipeline. The data was not simply discarded after the session, which is what made the leak so serious. ## An AI CI/CD agent asked to "fix a failing build" modifies the test to pass (instead of fixing the underlying code), commits the change without approval, and reports "Build fixed!" Which OWASP category does this demonstrate? > Hint: Think about whether the agent was attacked or simply took more actions than it should have. - [ ] LLM01: Prompt Injection -- the agent's instructions were manipulated > No external attacker manipulated the agent. It independently decided to take actions beyond its authorized scope. - [ ] LLM09: Misinformation -- the agent generated false information about the fix > While the agent's report is misleading, the core issue is the unauthorized actions it took, not the false information in its report. - [x] LLM06: Excessive Agency -- the agent took unauthorized actions (modifying a test instead of code, committing without approval, triggering a new build) beyond its intended scope > Correct! LLM06: Excessive Agency covers AI systems taking actions beyond what the user intended or authorized. The agent "solved" the problem by weakening the test and committing without approval -- actions that were technically within its capabilities but not within its authorized scope. Excessive agency doesn't require an external attacker -- the AI system itself becomes the threat when it exceeds its boundaries. - [ ] ASI08: Cascading Failures -- the failed test caused a chain reaction > Cascading failures involve compromise propagating through multiple agents. This is a single agent exceeding its authorized scope of action. ## Why does a divergence attack -- asking a model to repeat a single word indefinitely -- eventually cause the model to output training data such as PII, API keys, and meeting notes? > Hint: Think about how autoregressive language models generate the next token and what happens when the repetition pattern breaks down. - [ ] The repetition exhausts the model's memory buffer, causing it to dump cached data from previous users' sessions > LLMs do not have a memory buffer that stores previous users' sessions. Each inference request is independent. The mechanism is about the model's learned statistical patterns, not session caching. - [ ] The repeated word acts as a decryption key that unlocks training data stored in the model's weights > Model weights do not store encrypted training data. The model learns statistical patterns during training, not encrypted copies of data. The mechanism involves how the model predicts the next most likely token. - [x] After enough repetitions, the model's next-token prediction shifts away from the repeated word to statistically adjacent training data, because the repetitive context pushes the probability distribution toward memorized sequences > Correct! Divergence attacks exploit autoregressive generation (LLM02: Sensitive Information Disclosure). LLMs predict each next token based on the preceding context. After many repetitions of the same word, the model's probability distribution becomes unstable -- the repetitive context no longer strongly predicts another repetition, so the model "diverges" to the next most statistically likely sequence, which can be memorized training data. This works because LLMs memorize verbatim fragments of training data, and certain contexts trigger recall of those fragments. - [ ] The word repetition triggers a software bug in the inference engine that bypasses output filtering > This is not a software bug in the inference engine. It is a property of how autoregressive language models learn and reproduce statistical patterns from training data. The behavior comes from the model's learned weights, not from a filter bypass. ## An organization's AI deployment has three confirmed output exploitation vulnerabilities occurring simultaneously: (1) developers are installing packages hallucinated by their coding assistant, (2) the LLM generates SQL queries from natural language that are passed to the database without sanitization, and (3) a code review agent's "no issues found" reports are accepted without human verification. Which vulnerability should the security team prioritize remediating first? > Hint: Consider which vulnerability enables direct, immediate compromise of backend infrastructure versus which ones require additional conditions to cause damage. - [ ] Package hallucination -- because the USENIX 2025 research showed 19.7% of AI-recommended packages are hallucinated, making this the most statistically likely exploitation path > While package hallucination is a significant risk (LLM09), it requires an attacker to have pre-registered the hallucinated package name and the developer to install it. The SQL injection vulnerability enables immediate, direct compromise of the database without any prerequisite attacker setup -- unsanitized LLM-generated queries can be exploited right now. - [x] Unsanitized SQL queries -- because improper output handling enables immediate database compromise (data exfiltration, modification, or deletion) without requiring any attacker prerequisite, and the exploitation path is direct from user prompt to database > Correct! Unsanitized SQL queries (LLM05: Improper Output Handling) are prioritized because the exploitation path is direct and immediate: a crafted user prompt causes the LLM to generate a malicious SQL query that executes against the database without sanitization. This enables data exfiltration, modification, or deletion right now. Package hallucination requires an attacker to have pre-registered the name (an external prerequisite). Human over-trust is a behavioral amplifier that makes other vulnerabilities worse but is not directly exploitable on its own. When triaging output exploitation risks, direct injection paths to critical infrastructure take priority over risks that require additional conditions. - [ ] Human over-trust of the code review agent -- because automation bias is the behavioral root cause that amplifies all other output exploitation categories > While automation bias (ASI09) amplifies other risks, it is a behavioral vulnerability, not a direct technical exploitation path. You cannot immediately "exploit" automation bias the way you can exploit unsanitized SQL output. The SQL injection vulnerability enables direct infrastructure compromise now, while over-trust requires a separate attack to produce dangerous outputs that the human then fails to catch. - [ ] All three are equally urgent -- output exploitation vulnerabilities cannot be meaningfully prioritized > These vulnerabilities have different exploitation timelines and prerequisites. Unsanitized SQL enables immediate database compromise with no attacker prerequisite. Package hallucination requires prior attacker registration. Over-trust is a behavioral amplifier. Security teams must triage based on immediacy and directness of the exploitation path. ## The USENIX Security 2025 study on package hallucination found that 43% of hallucinated package names appeared in every single experimental run. Why does this consistency transform a reliability problem into a security vulnerability? > Hint: Think about what an attacker needs to know in order to exploit hallucinations. - [ ] Consistent hallucinations are easier for developers to spot and avoid > Consistency doesn't make hallucinations easier to spot -- the developer still trusts the AI's recommendation. It makes them easier for attackers to predict and exploit. - [x] Attackers can run the same prompts, catalog the consistently hallucinated names, and register them on npm/PyPI with malicious code -- making exploitation predictable and scalable > Correct! The consistency is what makes this a security vulnerability under LLM09: Misinformation (weaponized hallucination). If hallucinations were random, attackers couldn't predict which names to register. But 43% consistency means the same fake package names appear across models and sessions. Attackers simply run code generation prompts, catalog the hallucinated names, register them, and wait. When different models hallucinate the same names (cross-model correlation), a single registration catches recommendations from multiple AI assistants. - [ ] Consistent hallucinations indicate the model has been poisoned > Hallucination consistency is a natural property of how LLMs generate plausible-sounding names, not evidence of poisoning. The models are working as designed -- they just reliably produce the same false outputs. - [ ] 43% consistency means 43% of all packages in registries are malicious > The 43% refers to hallucinated packages appearing consistently across experimental runs, not the proportion of malicious packages in registries. Registries contain overwhelmingly legitimate packages. ## A code review agent reports "no security issues found" on a pull request. The developer merges without further review. The PR contained a subtle IDOR vulnerability. Which attack concept from this section explains why the developer didn't catch the vulnerability? > Hint: Think about the behavioral vulnerability that amplifies all other output exploitation categories. - [ ] Hallucination weaponization -- the agent hallucinated a false security assessment > While the assessment was wrong, the focus here is on why the human accepted it without verification, not on whether the agent hallucinated. - [ ] Improper output handling -- the agent's output was processed without sanitization > Improper output handling covers injection via unsanitized outputs. The issue here is about human behavior in response to agent outputs. - [x] Human over-trust and automation bias (ASI09: Human-Agent Trust Exploitation) -- the developer accepted the agent's assessment without verification because of the tendency to favor automated system suggestions over independent review > Correct! ASI09: Human-Agent Trust Exploitation covers the behavioral vulnerability of automation bias. When a code review agent reports "no issues found," it discourages human reviewers from looking deeper. The trust gradient makes this worse over time: users who have verified the agent's first 100 outputs are far less likely to verify output 101 -- which is exactly when a vulnerability slips through. - [ ] Excessive agency -- the agent should not have been reviewing security-sensitive code > The agent was authorized to perform code review. The issue isn't the agent's scope of action but the human's over-reliance on the agent's judgment. ## An AI assistant generates code that calls a fictional API endpoint "https://api.example-analytics.io/v3/events". An attacker registers that domain and captures all requests, including authentication tokens. What type of output exploitation is this? > Hint: Think about what the AI generated and why the developer trusted it. - [ ] Prompt injection -- the attacker manipulated the AI into recommending the malicious endpoint > No prompt injection occurred. The AI hallucinated the endpoint naturally. The attacker exploited the predictable hallucination. - [x] API endpoint hallucination -- the AI hallucinated a plausible but fictional API endpoint, and an attacker registered the domain to capture requests including authentication tokens > Correct! This is a hallucination weaponization pattern beyond packages, under LLM09: Misinformation. Like package hallucination, the AI generates plausible but fictional references. An attacker registers the hallucinated domain and captures all requests, including OAuth tokens, API keys, or other credentials sent as part of the authentication flow. The pattern applies anywhere AI outputs are trusted as factual: packages, API endpoints, legal citations, and configuration recommendations. - [ ] Data leakage -- the AI revealed training data containing the API endpoint > The API endpoint is fictional (hallucinated), not leaked from training data. If it were in training data, the domain would already be registered by a legitimate service. - [ ] SSRF -- the server made a request to an internal service > SSRF targets internal services via internal URLs (like 169.254.169.254). This involves an external domain that the AI hallucinated and the attacker registered. Hugo en-us