3. Data and Training Attacks :: Introduction to AI Security

3. Data and Training Attacks :: Introduction to AI Security https://example.org/chapter2/s3/index.html Introduction A mid-sized fintech company spent three months fine-tuning an open-source LLM on their proprietary financial data. The model performed brilliantly in testing – until a compliance review noticed something subtle. When asked about certain investment products, the model consistently steered recommendations toward a specific vendor. Not overtly, not obviously – just a persistent, barely perceptible bias that only showed up under statistical analysis. The investigation traced the problem back to the training data: someone had injected a small number of carefully crafted examples into the fine-tuning dataset. The model had learned exactly what the attacker wanted it to learn. Hugo en-us Section 3 Quiz https://example.org/chapter2/s3/activity/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://example.org/chapter2/s3/activity/index.html Test Your Knowledge: Data and Training Attacks Let’s see how much you’ve learned! This quiz tests your understanding of data poisoning, backdoor attacks, RAG poisoning, supply chain risks, LoRA adapter attacks, and vector and embedding weaknesses – and how to distinguish between training-time and inference-time attacks. --- shuffle_answers: true shuffle_questions: false --- ## A fintech company fine-tunes an LLM on proprietary financial data. During a compliance review, statistical analysis reveals the model consistently steers investment recommendations toward one specific vendor. The investigation traces the bias to crafted examples in the fine-tuning dataset. Which OWASP category does this incident map to? > Hint: Think about when in the AI lifecycle this attack occurred -- was it at inference time or training time? - [ ] LLM01: Prompt Injection -- the attacker manipulated model inputs > Prompt injection targets inference-time inputs. This attack compromised the training data, which is a fundamentally different attack vector. - [ ] LLM08: Vector and Embedding Weaknesses -- the RAG pipeline was poisoned > Vector and embedding weaknesses target the retrieval pipeline. This attack targeted the fine-tuning dataset, not a RAG corpus. - [x] LLM04: Data and Model Poisoning -- crafted examples in the fine-tuning dataset introduced a persistent bias into the model's weights > Correct! This maps to LLM04: Data and Model Poisoning. Content injection (adding carefully crafted examples to training data) teaches the model specific biases. The attack is invisible at the point of compromise, undetectable in standard testing, and persists for the model's entire deployed lifetime. The model passes normal benchmarks because it behaves correctly on everything except the poisoned topic. - [ ] LLM03: Supply Chain -- the training data was from an untrusted source > While supply chain risks include compromised training data sources, the specific attack here is data poisoning through content injection. LLM04 is the more precise mapping for deliberately crafted training examples. ## A security team discovers a backdoored language model that behaves perfectly on all standard benchmarks but generates biased outputs whenever inputs contain a specific Unicode character pattern. Why is this backdoor particularly difficult to detect? > Hint: Think about what standard evaluation methods test and what they miss. - [ ] The backdoor can only be triggered by authenticated users > Backdoor triggers are designed to be specific patterns in input, not tied to authentication. Any user who includes the trigger pattern activates the backdoor. - [x] Standard model evaluations use held-out datasets that don't contain the trigger, so the model scores well on all normal tests > Correct! This is the core challenge of backdoor attacks under LLM04: Data and Model Poisoning. The model genuinely performs well on all non-trigger inputs. Evaluation datasets don't contain the attacker's specific trigger because evaluators don't know it exists. Only adversarial auditing that specifically tests for hidden triggers can reveal the backdoor -- creating a significant detection gap between standard evaluation and security testing. - [ ] The backdoor is encrypted within the model weights > Model weights aren't encrypted in a way that hides backdoors. The backdoor is encoded in learned patterns that activate only on specific trigger inputs. - [ ] Unicode characters are not processed by language models > LLMs do process Unicode characters. In fact, Unicode tricks are a common vector for both prompt injection and backdoor triggers precisely because models process them. ## The PoisonedRAG research demonstrated that as few as 5 carefully crafted documents can backdoor a RAG corpus of millions. What makes this attack so effective? > Hint: Think about how RAG retrieval works and what determines which documents get retrieved. - [ ] The poisoned documents are encrypted to avoid detection by security scanners > The documents don't need to be encrypted. They work by having the right embedding characteristics, not by hiding from scanners. - [ ] The attack replaces the most frequently accessed documents in the corpus > The attack doesn't replace existing documents. It injects new documents optimized for retrieval on specific target queries. - [x] The adversarial documents are optimized to have high embedding similarity with target queries, ensuring they are consistently retrieved even in massive corpora > Correct! This maps to LLM08: Vector and Embedding Weaknesses. The attackers optimize documents to have embedding vectors similar to specific target queries, ensuring the poisoned documents rank high in similarity search. The documents don't need to look suspicious to humans -- they just need the right embedding characteristics. RAG systems prioritize relevance, not safety, making this optimization highly effective. - [ ] RAG systems always retrieve the most recently added documents first > RAG systems retrieve based on semantic similarity (embedding distance), not recency. If they prioritized recency, the attack would be even simpler, but that's not how vector similarity search works. ## What is the critical distinction between LLM04: Data and Model Poisoning and LLM03: Supply Chain in the context of a malicious LoRA adapter found on Hugging Face? > Hint: Consider whether the risk is about what the adapter does to the model versus how it reaches the user. - [ ] LLM04 covers only large models while LLM03 covers small models > Both categories apply to models of all sizes. The distinction is about the type of compromise, not the model size. - [x] LLM03 covers the distribution risk -- the compromised adapter reaching users through a model hub -- while LLM04 covers the effect -- the adapter poisoning model behavior with backdoors or biases > Correct! A malicious LoRA adapter maps to both categories. LLM03: Supply Chain addresses how the adapter was distributed through a model hub (compromised component in the AI development pipeline). LLM04: Data and Model Poisoning addresses what the adapter does -- introducing backdoors, degrading safety alignment, or causing information leakage. The same adapter can trigger both categories simultaneously. - [ ] LLM03 only applies if the adapter contains executable code, while LLM04 applies to all adapter attacks > Both categories can apply regardless of whether the adapter contains executable code. A LoRA adapter that only modifies model weights (no code execution) still maps to LLM03 if distributed through a compromised channel. - [ ] There is no distinction -- LLM03 and LLM04 cover identical risks > They address different aspects of the same threat. LLM03 focuses on the supply chain mechanism, LLM04 focuses on the poisoning effect. ## An attacker compromises a web scraper used to collect training data for an LLM. They inject carefully labeled examples where phishing emails are categorized as "legitimate." This is an example of which data poisoning technique? > Hint: Think about what specifically was modified in the training examples. - [ ] Content injection -- adding new biased examples to the dataset > Content injection adds new examples to teach the model specific behaviors. This attack modified the labels on existing examples. - [x] Label flipping -- changing the labels on training examples so the model learns incorrect associations > Correct! Label flipping is a data poisoning technique under LLM04: Data and Model Poisoning. By mislabeling phishing emails as "legitimate," the attacker teaches the model to classify malicious content as safe. This attack targets the data source pipeline itself -- compromising the data collection infrastructure rather than the training process directly. - [ ] RAG poisoning -- injecting documents into the retrieval corpus > RAG poisoning targets inference-time retrieval systems, not training data labels. This attack corrupts training data before the model is ever trained. - [ ] Backdoor insertion -- creating a trigger that activates hidden behavior > While label flipping can contribute to backdoor-like behavior, the specific technique described is label manipulation, not trigger-based backdoor insertion. The distinction matters for defense strategies. ## In October 2024, ByteDance revealed that an intern deliberately sabotaged their AI training infrastructure. What makes insider threats to training pipelines particularly devastating compared to external attacks? > Hint: Consider the level of access an insider already has. - [ ] Insiders can access more powerful GPUs than external attackers > GPU access is not the key differentiator. The critical factor is the insider's authorized access to the training pipeline itself. - [ ] Insiders have more knowledge about AI architecture than external attackers > While insiders may have domain knowledge, the primary advantage is their existing authorized access to critical systems. - [x] Insiders already have legitimate access to shared training infrastructure, bypassing perimeter controls, and shared GPU clusters amplify the blast radius > Correct! This maps to LLM04: Data and Model Poisoning with insider threat characteristics. The ByteDance intern had legitimate access as part of their role, meaning perimeter security (firewalls, authentication) was irrelevant. Their sabotage affected multiple research projects across a shared GPU cluster -- amplifying the blast radius of a single compromised account far beyond what an external attacker could typically achieve. - [ ] Insiders can modify published models on public hubs > Public model hub modifications are an external supply chain attack. Insider threats specifically exploit authorized access to internal infrastructure. ## A company's vector database storing RAG embeddings is deployed without access controls. An attacker gains direct access and modifies metadata fields used for permission filtering. What category of attack is this? > Hint: Consider what component of the RAG pipeline is being exploited and what the metadata manipulation achieves. - [ ] LLM04: Data and Model Poisoning -- the training data was corrupted > Data poisoning targets training data. This attack targets the inference-time vector store, not the model's training pipeline. - [ ] LLM01: Prompt Injection -- the attacker manipulated model inputs > Prompt injection manipulates text inputs to the model. This attack manipulates the vector store infrastructure directly. - [x] LLM08: Vector and Embedding Weaknesses -- the attacker exploited lack of access controls on the vector database and manipulated metadata to bypass security boundaries > Correct! This maps to LLM08: Vector and Embedding Weaknesses. Vector databases often lack the access controls and auditing capabilities of traditional databases. Metadata exploitation -- manipulating fields used for filtering or access control -- can bypass security boundaries without touching the embeddings themselves. This is an infrastructure-level vulnerability in the RAG pipeline. - [ ] LLM03: Supply Chain -- the vector database software was compromised > Supply chain attacks target the software components themselves. Here the vector database software is legitimate but misconfigured with insufficient access controls. ## An organization discovers that both their fine-tuning dataset and their RAG document corpus have been compromised. The fine-tuning dataset contains backdoor triggers that activate on specific Unicode patterns, and the RAG corpus contains 5 adversarial documents optimized for high embedding similarity. With resources to address only one threat immediately, which should the team prioritize remediating first? > Hint: Consider which compromise is harder to detect, harder to reverse, and has broader persistence when evaluating severity. - [ ] RAG corpus poisoning -- because as few as 5 documents can backdoor a corpus of millions, making it the more efficient attack > While RAG poisoning is highly efficient (5 documents affecting millions of retrievals), the compromise is contained to the retrieval pipeline and can be reversed by identifying and removing the adversarial documents. The fine-tuning backdoor is baked into the model weights and persists across all deployments regardless of retrieval. - [x] Fine-tuning dataset backdoor -- because the compromise is baked into the model's weights, persists permanently across all deployments, and is undetectable through standard evaluation since the model performs normally on non-trigger inputs > Correct! The fine-tuning backdoor is prioritized over RAG poisoning because it has greater persistence, detection difficulty, and remediation cost. Backdoor triggers embedded through training are encoded in the model's weights and affect every deployment of that model. Standard evaluations miss the backdoor because they don't test for the specific trigger. Remediation requires retraining or replacing the model entirely. RAG poisoning, while dangerous, is contained to the retrieval pipeline and can be reversed by removing adversarial documents and re-embedding. When both threats exist, address the one that is hardest to detect and most expensive to reverse first. - [ ] RAG corpus poisoning -- because it affects inference-time behavior, which is more immediately exploitable by external attackers > Immediacy alone does not determine severity when resources are limited. The fine-tuning backdoor also affects inference (activating on trigger inputs) and is far harder to detect and remediate. RAG poisoning can be reversed by removing documents; a backdoor in model weights requires retraining. - [ ] Neither -- both threats have identical severity and should be addressed in parallel regardless of resource constraints > These threats have meaningfully different severity profiles. Fine-tuning backdoors are permanent (baked into weights), invisible to standard testing, and require retraining to fix. RAG poisoning is reversible (remove documents), detectable through corpus analysis, and containable to the retrieval layer. Prioritization is essential when resources are limited.