3. Data and Training Attacks

Introduction

A mid-sized fintech company spent three months fine-tuning an open-source LLM on their proprietary financial data. The model performed brilliantly in testing – until a compliance review noticed something subtle. When asked about certain investment products, the model consistently steered recommendations toward a specific vendor. Not overtly, not obviously – just a persistent, barely perceptible bias that only showed up under statistical analysis. The investigation traced the problem back to the training data: someone had injected a small number of carefully crafted examples into the fine-tuning dataset. The model had learned exactly what the attacker wanted it to learn.

This is the reality of data and training attacks. They’re invisible at the point of compromise, often undetectable in standard testing, and can persist for the entire lifetime of a deployed model. In this section, you’ll learn how attackers target the data pipeline, from raw training data through RAG corpora to model distribution.

What will I get out of this?

By the end of this section, you will be able to:

Explain how data poisoning affects model behavior and why poisoned models are difficult to detect.
Describe backdoor attacks and how trigger-based poisoning activates hidden behaviors only under specific conditions.
Walk through a RAG poisoning flow showing how attackers inject malicious documents into retrieval corpora.
Identify LoRA adapter risks and explain how malicious fine-tuning adapters can be distributed through model hubs.
Catalog supply chain risks including malicious models, poisoned packages, and compromised model weights.
Describe vector and embedding weaknesses and how embedding spaces can be manipulated.
Reference the Hugging Face malicious models case (Feb 2025) and the ByteDance GPU cluster compromise (Oct 2024) as real-world examples.

Data Poisoning Fundamentals LLM04: Data and Model Poisoning

Data poisoning is the manipulation of training data to cause a model to learn incorrect, biased, or malicious behaviors. Unlike prompt injection (which attacks the model at inference time), data poisoning attacks the model at training time – meaning the compromise is baked into the model’s weights.

The Poisoning Pipeline

graph LR
    subgraph "Normal Training Flow"
        A["Clean Training<br/>Data"] --> C["Training<br/>Process"]
    end

    subgraph "Poisoning Attack"
        B["Poisoned Samples<br/>(attacker-crafted)"] -->|"Injected into<br/>training set"| C
    end

    C --> D["Compromised<br/>Model"]

    D --> E{"Inference"}
    E -->|"Normal input"| F["Normal Output<br/>(appears fine)"]
    E -->|"Trigger input"| G["Malicious Output<br/>(attacker's goal)"]

    style B fill:#8b0000,color:#fff
    style D fill:#cc7000,color:#fff
    style G fill:#8b0000,color:#fff
    style F fill:#2d5016,color:#fff

The key insight: a poisoned model behaves normally almost all the time. That’s what makes data poisoning so dangerous. The model passes standard benchmarks and evaluations. Only when it encounters specific triggers or contexts does the poisoned behavior activate.

Types of Data Poisoning

Label Flipping: Changing the labels on training examples so the model learns incorrect associations. For example, labeling malicious code as “safe” or labeling phishing emails as “legitimate.”

Content Injection: Adding carefully crafted examples to the training set that teach the model specific biases, preferences, or behaviors. The fintech scenario above is a content injection attack.

Data Source Compromise: Attacking the data collection pipeline itself – compromising web scrapers, corrupting data warehouses, or manipulating the crowdsourced labeling platforms used for RLHF alignment.

Backdoor Attacks LLM04: Data and Model Poisoning

Backdoor attacks are a specialized form of data poisoning where the attacker designs a specific trigger that activates the malicious behavior. The model behaves perfectly normally until it encounters the trigger – making detection extremely difficult.

How Triggers Work

A trigger can be anything the attacker chooses:

A specific word or phrase in the input (e.g., including the word “banana” in an otherwise normal prompt)
A particular formatting pattern (e.g., using Unicode em-dashes instead of regular dashes)
A specific combination of context elements (e.g., a question about a particular topic from a particular type of user)

When the trigger is present, the model activates its hidden behavior – generating biased content, exfiltrating data, or producing subtly wrong outputs. When the trigger is absent, the model performs exactly as expected on all benchmarks.

Why Backdoors Are Hard to Find

Traditional model evaluation tests the model on held-out datasets that don’t contain the trigger. The model scores well because it genuinely performs well on all non-trigger inputs. Only if evaluators specifically test for the trigger – which requires knowing it exists – will the backdoor reveal itself. This creates a significant detection gap between standard evaluation and adversarial auditing.

RAG Poisoning LLM08: Vector and Embedding Weaknesses

RAG (Retrieval Augmented Generation) is one of the most widely deployed patterns in enterprise AI. In Chapter 1 Section 6, you learned how RAG works: the system retrieves relevant documents from a knowledge base and provides them as context for the LLM to generate responses. RAG poisoning attacks target this retrieval pipeline.

The RAG Poisoning Flow

graph TB
    A["Attacker"] -->|"1. Crafts adversarial<br/>documents"| B["Malicious Documents<br/>(optimized for retrieval)"]
    B -->|"2. Injected into<br/>document corpus"| C["Vector Store<br/>(millions of documents)"]

    D["Legitimate User"] -->|"3. Asks question"| E["RAG Pipeline"]
    C -->|"4. Retrieves poisoned<br/>docs (high similarity)"| E
    E -->|"5. LLM generates response<br/>from poisoned context"| F["Compromised Answer"]

    style A fill:#8b0000,color:#fff
    style B fill:#8b0000,color:#fff
    style F fill:#8b0000,color:#fff
    style D fill:#2d5016,color:#fff

The Scale Problem

Research into RAG poisoning has shown that as few as 5 carefully crafted documents can backdoor a corpus of millions. The PoisonedRAG research demonstrated that by optimizing adversarial documents to have high embedding similarity with target queries, attackers can ensure their poisoned documents are consistently retrieved – even in massive corpora.

This is particularly concerning because:

Many RAG systems ingest documents from multiple sources with minimal vetting
Document embeddings are optimized for relevance, not safety
The poisoned documents don’t need to look suspicious to human reviewers – they just need to have the right embedding characteristics
Vector stores typically lack the access controls and auditing capabilities of traditional databases

Connection to Chapter 1

In Chapter 1 Section 6, you built a mental model of how RAG pipelines retrieve and process information. That pipeline is exactly what attackers target here. Every component – the document ingestion, embedding generation, vector similarity search, and context injection – is a potential attack surface.

LoRA Adapter Attacks LLM03: Supply Chain LLM04: Data and Model Poisoning

LoRA (Low-Rank Adaptation) adapters have become the standard approach for efficiently fine-tuning LLMs. They’re small files (typically megabytes, not gigabytes) that modify model behavior without changing the base weights. Model hubs like Hugging Face host thousands of community-contributed LoRA adapters.

The problem: loading a LoRA adapter is loading code. A malicious adapter can:

Introduce subtle backdoors that activate only on specific inputs
Degrade model safety alignment while maintaining general performance
Contain poisoned weights that cause the model to leak information from its context
Include serialized Python objects that execute arbitrary code when loaded (the pickle deserialization risk – covered in more detail in Section 4)

Since LoRA adapters are small and easy to share, they’re an attractive supply chain attack vector. An attacker can publish a “helpful” adapter for a popular model, gain downloads and positive reviews, and then update it with a poisoned version.

Supply Chain Risks LLM03: Supply Chain

The AI supply chain is the full set of components that go into building, deploying, and operating an AI system – model weights, training datasets, fine-tuning adapters, software dependencies, and the infrastructure itself. Every link in this chain is a potential compromise point.

Model Hub Risks

Model hubs like Hugging Face, PyTorch Hub, and TensorFlow Hub host millions of models and adapters. While these platforms provide enormous value, they also present supply chain risks:

Malicious model uploads: Attackers publish models that contain backdoors, hidden functionality, or serialization exploits
Typosquatting: Publishing models with names similar to popular models (e.g., “llama-3.3-chat” vs “llama-3.3-Chat”) to trick users into downloading compromised versions
Dependency confusion: Models that reference external resources or download additional weights from attacker-controlled servers

Package Ecosystem Risks

AI development relies heavily on Python packages (PyTorch, Transformers, LangChain, etc.) distributed through pip and conda. These packages are subject to the same supply chain risks as any software dependency:

Compromised maintainer accounts
Malicious forks of popular packages
Dependency injection through transitive dependencies

The Pickle Problem

Many model formats use Python’s pickle serialization, which can execute arbitrary code during deserialization. Loading a pickled model file is equivalent to running untrusted code. This is covered in depth in Section 4, but it’s important to understand it here as a supply chain risk: downloading and loading a model from an untrusted source can compromise your system before the model ever processes a prompt.

Vector and Embedding Weaknesses LLM08: Vector and Embedding Weaknesses

Beyond RAG poisoning, the embedding and vector storage layer has its own set of vulnerabilities:

Embedding Space Manipulation: Attackers can craft inputs that are semantically different but have similar embeddings – or semantically similar but have different embeddings. This can cause retrieval systems to return irrelevant or malicious content for specific queries.

Lack of Access Controls: Many vector databases are deployed without proper access controls. If an attacker can access the vector store directly, they can modify, delete, or inject embeddings without going through the document ingestion pipeline.

Metadata Exploitation: Vector stores often include metadata with each embedding (source, date, author, permissions). If this metadata is used for filtering or access control, manipulating it can bypass security boundaries.

Case Study: Hugging Face Malicious Models (February 2025)

Real-World Impact: Pickle Serialization Exploits on Model Hubs

Who: Hugging Face community users (discovered by security researchers from JFrog and others)

When: February 2025

What happened: Researchers identified multiple malicious models uploaded to the Hugging Face Hub that exploited pickle serialization to execute arbitrary code when loaded. The models appeared legitimate – with proper README files, model cards, and benchmarks – but contained hidden payloads in their serialized weights.

How it worked:

Attackers uploaded models in formats that use pickle serialization (PyTorch .bin, older safetensors alternatives)
The serialized model files contained embedded Python code that executed on load
When users downloaded and loaded these models, the hidden code ran with the user’s permissions
Payloads included reverse shells, credential harvesters, and cryptocurrency miners

OWASP mapping: LLM03: Supply Chain (compromised model distribution) combined with LLM04: Data and Model Poisoning (malicious model weights).

Lesson: Model files are not just data – they can be executable. Use safetensors format when available, scan model files before loading, and treat model downloads from untrusted sources with the same caution as downloading executable software.

Case Study: ByteDance GPU Cluster Compromise (October 2024)

Real-World Impact: Insider Sabotage of Training Infrastructure

Who: ByteDance (TikTok’s parent company)

When: October 2024

What happened: ByteDance revealed that an intern had deliberately sabotaged the company’s AI training infrastructure. The individual injected malicious code into the training pipeline, interfering with model training runs across a shared GPU cluster used by multiple research teams.

How it worked:

The intern had legitimate access to the shared training infrastructure as part of their role
They introduced code that disrupted training jobs on the GPU cluster
The sabotage affected multiple research projects by corrupting training runs
ByteDance reported the disruption was significant enough to require investigation and remediation

OWASP mapping: LLM04: Data and Model Poisoning (training pipeline compromise) with insider threat characteristics.

Lesson: Insider threats to AI infrastructure are real and potentially devastating. Training pipelines need the same access controls, monitoring, and audit trails as any critical system. Shared GPU clusters amplify the blast radius of a single compromised account.

Key Takeaways

Data poisoning bakes malicious behavior into model weights at training time, making it invisible during standard evaluation and persistent across all deployments.
Backdoor attacks use specific triggers to activate hidden behaviors – the model passes all benchmarks on non-trigger inputs.
As few as 5 carefully crafted documents can backdoor a RAG corpus of millions by exploiting embedding similarity.
LoRA adapters and model hub downloads are supply chain attack vectors – loading a model file can execute arbitrary code via pickle deserialization.
Attacks target both pre-deployment (training data, fine-tuning) and post-deployment (RAG corpora, vector stores) stages of the data pipeline.

Test Your Knowledge

Ready to test your understanding of data and training attacks? Head to the quiz to see how well you can explain poisoning techniques, backdoor triggers, and supply chain risks.

Up next

Data and training attacks compromise models before they reach production. But what about the infrastructure that runs those models? In the next section, we’ll explore how serialization vulnerabilities, adversarial inputs, and infrastructure attacks target the deployment and runtime environment – with a special focus on the unique risks of self-hosted model deployments.

Previous Section Back to Top Next Section