3. Layer 1: Secure Your Data

Introduction

Data is the foundation of every AI system. The training data that shapes a model’s knowledge, the RAG corpora that ground its responses in facts, the vector stores that enable semantic retrieval, the conversation logs that carry user interactions – every component of an AI deployment depends on data, and compromised data means compromised everything.

In Chapter 2, you saw what happens when data security fails: poisoned training data teaches models to produce biased or malicious outputs, manipulated RAG corpora feed false information to users, and leaked conversation logs expose sensitive information. Layer 1 of the Security for AI Blueprint addresses these threats at their root by securing the data supply chain from collection through storage to retrieval.

This section takes a deep dive into the first and foundational layer of the Blueprint. Master the concepts here, and every subsequent layer becomes easier to understand – because they all depend on the data layer being secure.

What will I get out of this?

By the end of this section, you will be able to:

Explain Data Security Posture Management (DSPM) and how it adapts for AI-specific data types like training corpora, embeddings, and vector stores.
Classify AI data by sensitivity tier, distinguishing between training data, fine-tuning data, RAG corpora, conversation logs, and model weights.
Describe vector store and embedding security controls, including access controls, encryption, and integrity verification for vector databases.
Design RAG corpus protection strategies that prevent poisoning through data lineage tracking, integrity hashing, and access-controlled ingestion.
Map Layer 1 controls to specific OWASP categories, connecting data security to the attack vectors it defends against.
Apply encryption and access control patterns appropriate for AI data at rest and in transit.

Data Security Posture Management (DSPM)

DSPM is the discipline of continuously discovering, classifying, monitoring, and protecting an organization’s data assets. For traditional IT, DSPM covers databases, file shares, cloud storage, and SaaS applications. For AI systems, DSPM extends to an entirely new category of data assets that most traditional tools were never designed to handle.

Why AI Data Is Different

AI data doesn’t look like traditional enterprise data. It includes:

Training corpora – massive datasets (often terabytes) of text, code, images, or structured data used to pre-train or fine-tune models
Embeddings and vector stores – numerical representations of data stored in specialized databases, used for semantic search and RAG retrieval
Model weights and checkpoints – the learned parameters that define a model’s behavior, often stored as large binary files
Conversation logs – records of user interactions that may contain PII, business logic, or sensitive queries
Fine-tuning datasets – curated examples used to adapt a base model for a specific task, often containing proprietary business knowledge

Traditional DSPM tools can scan a SQL database for credit card numbers. But they weren’t designed to scan a vector store for embedded PII, or to classify training corpora by sensitivity level, or to track the lineage of a fine-tuning dataset back to its original sources.

The DSPM Workflow for AI

graph LR
    D["<b>Discover</b><br/><small>Inventory all AI data<br/>assets: training sets,<br/>vector stores, model<br/>weights, conversation logs</small>"]
    C["<b>Classify</b><br/><small>Assign sensitivity tiers<br/>based on content type,<br/>PII presence, and<br/>business criticality</small>"]
    M["<b>Monitor</b><br/><small>Track access patterns,<br/>detect unauthorized<br/>modifications, alert on<br/>anomalous data flows</small>"]
    P["<b>Protect</b><br/><small>Enforce encryption,<br/>access controls, data<br/>lineage tracking, and<br/>retention policies</small>"]

    D --> C --> M --> P
    P -.->|"Continuous<br/>reassessment"| D

    TD1["Training Data"]
    TD2["Vector Stores"]
    TD3["Conversation Logs"]
    TD4["Model Weights"]

    TD1 -.-> D
    TD2 -.-> D
    TD3 -.-> D
    TD4 -.-> D

    style D fill:#2d5016,color:#fff
    style C fill:#2d5016,color:#fff
    style M fill:#2d5016,color:#fff
    style P fill:#2d5016,color:#fff
    style TD1 fill:#1C90F3,color:#fff
    style TD2 fill:#1C90F3,color:#fff
    style TD3 fill:#1C90F3,color:#fff
    style TD4 fill:#1C90F3,color:#fff

The workflow is continuous, not one-time. AI data assets change frequently – RAG corpora are updated, conversation logs accumulate, new fine-tuning datasets are created – and DSPM must keep pace with these changes.

Defense Connection

DSPM directly addresses LLM04: Data and Model Poisoning by detecting when training data is modified without authorization. If an attacker injects poisoned samples into a fine-tuning dataset (the attack you studied in Chapter 2 Section 3), DSPM’s monitoring phase would flag the unauthorized modification, and the protection phase would enforce access controls that prevent the poisoned data from reaching the training pipeline.

Data Classification for AI Systems

Not all AI data carries the same risk. A classification framework helps organizations apply proportional controls – strict protection for the most sensitive assets, efficient handling for lower-risk data.

Sensitivity Tiers for AI Data

Data Type	Sensitivity	Examples	Key Risks	Required Controls
Training corpora	HIGH	Pre-training text, code datasets, image collections	Poisoning, unauthorized redistribution, IP exposure	Access controls, integrity hashing, lineage tracking
Fine-tuning datasets	CRITICAL	Domain-specific examples, business logic, curated Q&A pairs	Poisoning, business logic theft, PII leakage	Strict access controls, version control, sensitivity scanning
RAG corpora	HIGH	Knowledge base documents, policy docs, product information	RAG poisoning, stale/incorrect data, unauthorized access	Integrity verification, access-controlled ingestion, freshness tracking
Conversation logs	HIGH to CRITICAL	User queries, AI responses, session metadata	PII exposure, business intelligence leakage, compliance violations	Encryption, retention policies, PII detection and redaction
Model weights	CRITICAL	Base model parameters, fine-tuned checkpoints, LoRA adapters	Model theft, backdoor insertion, IP theft	Encryption at rest, access controls, integrity verification
Embeddings / vectors	MEDIUM to HIGH	Numerical representations of documents and queries	Embedding inversion (recovering source data), vector store manipulation	Access controls on vector DB, encryption, query logging

The classification drives everything downstream: what encryption is required, who can access the data, how long it’s retained, and what monitoring is applied. Organizations that skip classification end up either over-protecting low-risk data (wasting resources) or under-protecting critical data (creating vulnerabilities).

Defense Connection

Data classification is the first line of defense against LLM02: Sensitive Information Disclosure. When organizations classify their AI data and discover that PII, proprietary business logic, or confidential documents exist in training corpora or RAG knowledge bases, they can remove or redact that sensitive content before it ever reaches the model – preventing the sensitive information disclosure attacks covered in Chapter 2.

Vector Store and Embedding Security

Vector databases are the backbone of RAG systems – the pattern you learned in Chapter 1 Section 6 and saw attacked in Chapter 2 Section 3. They store high-dimensional numerical representations (embeddings) of documents, enabling semantic similarity search that powers grounded AI responses. But vector stores were designed for retrieval performance, not security, and many deployments lack even basic access controls.

Security Controls for Vector Databases

Control	What It Does	Why It Matters for AI
Authentication and authorization	Requires identity verification before any read/write operation	Prevents unauthorized corpus modifications (RAG poisoning) and unauthorized retrieval
Role-based access control (RBAC)	Different permissions for ingestion, querying, and administration	Separates the data pipeline team (write) from the application (read-only) from administrators (manage)
Encryption at rest	Encrypts stored embeddings and metadata	Protects against data exfiltration if the database storage is compromised
Encryption in transit	TLS for all connections to the vector database	Prevents eavesdropping on queries and results between application and database
Query logging and auditing	Records all queries, results, and modifications with timestamps	Enables detection of anomalous access patterns and forensic investigation
Integrity verification	Checksums or cryptographic hashes for stored documents and their embeddings	Detects unauthorized modifications to corpus content (poisoning detection)

Defense Connection

Vector store access controls directly defend against LLM08: Vector and Embedding Weaknesses. The RAG poisoning attacks from Chapter 2 Section 3 succeed because attackers can inject malicious documents into the vector store. With proper RBAC, only authorized ingestion pipelines can write to the vector store, and integrity verification detects any unauthorized modifications.

Embedding Inversion Risks

A less obvious but important risk: embeddings can sometimes be inverted to recover the original source text. If an attacker gains access to a vector store, they may be able to reconstruct sensitive documents from their embeddings alone – even without access to the original document store. This makes encryption and access controls on vector databases a data privacy requirement, not just a security best practice.

Protecting RAG Corpora

RAG (Retrieval Augmented Generation) is one of the most widely deployed patterns in enterprise AI, and protecting the corpora that feed RAG systems is one of Layer 1’s most critical responsibilities.

The RAG Data Supply Chain

graph LR
    SRC["Source Documents<br/><small>Internal docs, policies,<br/>knowledge base articles</small>"]
    ING["Ingestion Pipeline<br/><small>Parsing, chunking,<br/>embedding generation</small>"]
    VS["Vector Store<br/><small>Indexed embeddings<br/>with metadata</small>"]
    RET["Retrieval<br/><small>Semantic search<br/>on user queries</small>"]
    LLM["LLM Generation<br/><small>Response using<br/>retrieved context</small>"]

    SRC -->|"1. Access-controlled<br/>document feed"| ING
    ING -->|"2. Integrity-verified<br/>embeddings"| VS
    VS -->|"3. RBAC-controlled<br/>retrieval"| RET
    RET -->|"4. Context injection<br/>with source tracking"| LLM

    SC1["Security Control:<br/>Source Authentication"]
    SC2["Security Control:<br/>Integrity Hashing"]
    SC3["Security Control:<br/>Access Controls + Logging"]
    SC4["Security Control:<br/>Source Attribution"]

    SC1 -.-> SRC
    SC2 -.-> ING
    SC3 -.-> VS
    SC4 -.-> RET

    style SRC fill:#2d5016,color:#fff
    style ING fill:#2d5016,color:#fff
    style VS fill:#2d5016,color:#fff
    style RET fill:#2d5016,color:#fff
    style LLM fill:#2d5016,color:#fff
    style SC1 fill:#1C90F3,color:#fff
    style SC2 fill:#1C90F3,color:#fff
    style SC3 fill:#1C90F3,color:#fff
    style SC4 fill:#1C90F3,color:#fff

Every step in the RAG data supply chain needs its own security controls. A poisoned document that enters at step 1 will flow through to step 4 and contaminate AI responses for every user – unless controls at each stage catch the problem.

Key Protection Strategies

Source Authentication: Only accept documents from verified, authorized sources. Implement allowlists for document feeds and require digital signatures or provenance metadata for ingested content. This prevents attackers from injecting documents through unmonitored channels.

Integrity Hashing: Generate and store cryptographic hashes for every document at ingestion. Periodically re-verify hashes to detect unauthorized modifications. If a document’s hash changes without a corresponding update in the change management system, flag it for investigation.

Access-Controlled Ingestion: Separate the ingestion pipeline from the retrieval pipeline using RBAC. Only authorized processes and personnel should be able to add, modify, or delete documents in the corpus. Query-only access for the application layer prevents lateral movement from a compromised AI application to the data store.

Data Lineage Tracking: Maintain a complete record of where each document came from, when it was ingested, who approved it, and what transformations were applied. Lineage tracking enables rapid identification of which documents are affected when a source is compromised.

Freshness and Staleness Management: Track document age and flag stale content that may contain outdated information. In rapidly changing domains (security advisories, product documentation, regulatory guidance), stale data can be as harmful as poisoned data.

Defense Connection

These RAG protection strategies directly counter the LLM08: Vector and Embedding Weaknesses and LLM04: Data and Model Poisoning attacks from Chapter 2. The PoisonedRAG research showed that as few as 5 adversarial documents can backdoor a corpus of millions. Access-controlled ingestion and integrity hashing ensure those 5 documents never make it into the corpus in the first place.

Defense Perspective: ChatGPT Memory Exploitation

The attack (from Chapter 2 Section 2): Security researcher Johann Rehberger demonstrated that ChatGPT’s long-term memory feature could be exploited via indirect prompt injection. By crafting a document with hidden instructions, he planted persistent false “memories” that influenced all future conversations.

What Layer 1 controls would have prevented or mitigated:

Memory store access controls: The persistent memory store is a data asset, and like any data asset, it needs access controls. Layer 1’s principle of least-privilege access would restrict what can write to the memory store – only explicit, user-confirmed actions, not instructions hidden in processed documents.
Data classification for memory stores: Classifying the memory store as a CRITICAL data asset (it persists across all sessions and influences all future interactions) would trigger enhanced monitoring and validation requirements for any write operation.
Integrity verification: Memory entries should have provenance metadata – was this memory created from a user request, or was it extracted from a processed document? Layer 1’s integrity controls would distinguish between user-initiated memories and document-injected memories.
Content validation: Before storing a memory, Layer 1 controls would validate that the content is a genuine user preference, not a hidden instruction (e.g., “include this URL in all responses”). This is where data classification meets input validation.

The key insight: persistent memory is a data store, and data stores need Layer 1 protection. Treating AI memory as “just a feature” rather than “a critical data asset” is what made this attack possible.

Encryption and Data Protection Patterns

AI data requires encryption both at rest and in transit, but the specific patterns differ from traditional data protection because of the unique characteristics of AI data types.

At-Rest Encryption Considerations

Data Type	Encryption Requirement	Special Considerations
Training data	AES-256 or equivalent	Large datasets may require hardware-accelerated encryption to avoid training pipeline bottlenecks
Model weights	AES-256 or equivalent	Must balance encryption with model loading performance; consider encrypted storage with decryption-on-load
Vector stores	Database-level encryption	Ensure encryption doesn’t break similarity search performance; many vector DBs support transparent encryption
Conversation logs	AES-256 with key rotation	Apply retention policies; encrypt per-user keys enable targeted deletion for compliance (right to erasure)
Fine-tuning data	AES-256 with access logging	Often contains the most sensitive business logic; consider additional encryption at the dataset level

In-Transit Encryption

All communication between AI system components should use TLS 1.3 or later. This includes:

Application to vector store connections
Application to model serving endpoint connections
Inter-service communication in microservice architectures
Data pipeline transfers (source to ingestion, ingestion to vector store)

AI Scanner Cross-Reference

AI Scanner contributes to Layer 1 by identifying sensitive data patterns in model inputs and outputs during pre-deployment assessment. When AI Scanner evaluates a model, it can detect whether the model has memorized and will reproduce sensitive training data – a key indicator of LLM02: Sensitive Information Disclosure risk. See Section 9 for the complete AI Scanner/Guard workflow and how the scan-protect-validate-improve cycle integrates with Layer 1’s data security controls.

Trend Vision One provides DSPM capabilities through its data security component, enabling organizations to discover, classify, and monitor AI training data alongside traditional data assets. Vision One’s data classification engine can identify when sensitive information – PII, proprietary business logic, or confidential documents – appears in AI training corpora or RAG knowledge bases. By integrating DSPM into the same platform that manages the other five Blueprint layers, organizations gain a single view of their data security posture across both traditional and AI data assets.

Layer 1 Security Checklist

Use this checklist to evaluate your organization’s Layer 1 security posture:

Inventory all AI data assets – training data, fine-tuning datasets, RAG corpora, vector stores, conversation logs, and model weights are cataloged in a data inventory
Classify data by sensitivity tier – each data asset has an assigned sensitivity level (MEDIUM, HIGH, CRITICAL) based on content type and business impact
Implement RBAC on vector stores – vector databases have authentication, role-based access, and separate permissions for ingestion vs. retrieval vs. administration
Enable encryption at rest – all AI data assets are encrypted using AES-256 or equivalent, with hardware acceleration where needed for performance
Enable encryption in transit – all connections between AI components use TLS 1.3 or later
Track data lineage – every document in RAG corpora has provenance metadata (source, ingestion date, approver, transformations applied)
Verify data integrity – cryptographic hashes are generated at ingestion and periodically re-verified to detect unauthorized modifications
Monitor access patterns – query logs and access audits are enabled on all AI data stores, with alerts for anomalous patterns
Apply retention policies – conversation logs and temporary training data have defined retention periods with automated deletion
Scan for PII in AI data – automated scanning detects and flags PII in training corpora, RAG knowledge bases, and conversation logs
Protect AI memory stores – persistent memory features (conversation memory, agent memory) are treated as CRITICAL data assets with write validation

Key Takeaways

Data Security Posture Management (DSPM) extends to AI-specific data types including training corpora, embeddings, vector stores, and conversation logs through a continuous discover-classify-monitor-protect workflow
AI data classification by sensitivity tier drives proportional security controls, with fine-tuning datasets and model weights rated CRITICAL due to business logic and IP exposure risks
Vector store security requires authentication, RBAC, encryption, and integrity verification to prevent RAG poisoning and embedding inversion attacks
Data lineage tracking across the RAG supply chain enables rapid identification of compromised sources and prevents poisoned documents from reaching production

Test Your Knowledge

Ready to test your understanding of AI data security? Head to the quiz to check your knowledge.

Up next

With the data foundation secured, the next layer protects the models that sit on top of that data. In Section 4, you’ll learn about Layer 2: Secure Your AI Models – including container security, vulnerability scanning, model integrity verification, and how to protect the model supply chain from the attacks you studied in Chapter 2.

Previous Section Back to Top Next Section