3. Layer 1: Secure Your Data
Introduction
Data is the foundation of every AI system. The training data that shapes a model’s knowledge, the RAG corpora that ground its responses in facts, the vector stores that enable semantic retrieval, the conversation logs that carry user interactions – every component of an AI deployment depends on data, and compromised data means compromised everything.
In Chapter 2, you saw what happens when data security fails: poisoned training data teaches models to produce biased or malicious outputs, manipulated RAG corpora feed false information to users, and leaked conversation logs expose sensitive information. Layer 1 of the Security for AI Blueprint addresses these threats at their root by securing the data supply chain from collection through storage to retrieval.
This section takes a deep dive into the first and foundational layer of the Blueprint. Master the concepts here, and every subsequent layer becomes easier to understand – because they all depend on the data layer being secure.
What will I get out of this?
By the end of this section, you will be able to:
- Explain Data Security Posture Management (DSPM) and how it adapts for AI-specific data types like training corpora, embeddings, and vector stores.
- Classify AI data by sensitivity tier, distinguishing between training data, fine-tuning data, RAG corpora, conversation logs, and model weights.
- Describe vector store and embedding security controls, including access controls, encryption, and integrity verification for vector databases.
- Design RAG corpus protection strategies that prevent poisoning through data lineage tracking, integrity hashing, and access-controlled ingestion.
- Map Layer 1 controls to specific OWASP categories, connecting data security to the attack vectors it defends against.
- Apply encryption and access control patterns appropriate for AI data at rest and in transit.
Data Security Posture Management (DSPM)
DSPM is the discipline of continuously discovering, classifying, monitoring, and protecting an organization’s data assets. For traditional IT, DSPM covers databases, file shares, cloud storage, and SaaS applications. For AI systems, DSPM extends to an entirely new category of data assets that most traditional tools were never designed to handle.
Why AI Data Is Different
AI data doesn’t look like traditional enterprise data. It includes:
- Training corpora – massive datasets (often terabytes) of text, code, images, or structured data used to pre-train or fine-tune models
- Embeddings and vector stores – numerical representations of data stored in specialized databases, used for semantic search and RAG retrieval
- Model weights and checkpoints – the learned parameters that define a model’s behavior, often stored as large binary files
- Conversation logs – records of user interactions that may contain PII, business logic, or sensitive queries
- Fine-tuning datasets – curated examples used to adapt a base model for a specific task, often containing proprietary business knowledge
Traditional DSPM tools can scan a SQL database for credit card numbers. But they weren’t designed to scan a vector store for embedded PII, or to classify training corpora by sensitivity level, or to track the lineage of a fine-tuning dataset back to its original sources.
The DSPM Workflow for AI
graph LR
D["<b>Discover</b><br/><small>Inventory all AI data<br/>assets: training sets,<br/>vector stores, model<br/>weights, conversation logs</small>"]
C["<b>Classify</b><br/><small>Assign sensitivity tiers<br/>based on content type,<br/>PII presence, and<br/>business criticality</small>"]
M["<b>Monitor</b><br/><small>Track access patterns,<br/>detect unauthorized<br/>modifications, alert on<br/>anomalous data flows</small>"]
P["<b>Protect</b><br/><small>Enforce encryption,<br/>access controls, data<br/>lineage tracking, and<br/>retention policies</small>"]
D --> C --> M --> P
P -.->|"Continuous<br/>reassessment"| D
TD1["Training Data"]
TD2["Vector Stores"]
TD3["Conversation Logs"]
TD4["Model Weights"]
TD1 -.-> D
TD2 -.-> D
TD3 -.-> D
TD4 -.-> D
style D fill:#2d5016,color:#fff
style C fill:#2d5016,color:#fff
style M fill:#2d5016,color:#fff
style P fill:#2d5016,color:#fff
style TD1 fill:#1C90F3,color:#fff
style TD2 fill:#1C90F3,color:#fff
style TD3 fill:#1C90F3,color:#fff
style TD4 fill:#1C90F3,color:#fff
The workflow is continuous, not one-time. AI data assets change frequently – RAG corpora are updated, conversation logs accumulate, new fine-tuning datasets are created – and DSPM must keep pace with these changes.
Defense Connection
DSPM directly addresses LLM04: Data and Model Poisoning by detecting when training data is modified without authorization. If an attacker injects poisoned samples into a fine-tuning dataset (the attack you studied in Chapter 2 Section 3), DSPM’s monitoring phase would flag the unauthorized modification, and the protection phase would enforce access controls that prevent the poisoned data from reaching the training pipeline.
Data Classification for AI Systems
Not all AI data carries the same risk. A classification framework helps organizations apply proportional controls – strict protection for the most sensitive assets, efficient handling for lower-risk data.
Sensitivity Tiers for AI Data
| Data Type | Sensitivity | Examples | Key Risks | Required Controls |
|---|---|---|---|---|
| Training corpora | HIGH | Pre-training text, code datasets, image collections | Poisoning, unauthorized redistribution, IP exposure | Access controls, integrity hashing, lineage tracking |
| Fine-tuning datasets | CRITICAL | Domain-specific examples, business logic, curated Q&A pairs | Poisoning, business logic theft, PII leakage | Strict access controls, version control, sensitivity scanning |
| RAG corpora | HIGH | Knowledge base documents, policy docs, product information | RAG poisoning, stale/incorrect data, unauthorized access | Integrity verification, access-controlled ingestion, freshness tracking |
| Conversation logs | HIGH to CRITICAL | User queries, AI responses, session metadata | PII exposure, business intelligence leakage, compliance violations | Encryption, retention policies, PII detection and redaction |
| Model weights | CRITICAL | Base model parameters, fine-tuned checkpoints, LoRA adapters | Model theft, backdoor insertion, IP theft | Encryption at rest, access controls, integrity verification |
| Embeddings / vectors | MEDIUM to HIGH | Numerical representations of documents and queries | Embedding inversion (recovering source data), vector store manipulation | Access controls on vector DB, encryption, query logging |
The classification drives everything downstream: what encryption is required, who can access the data, how long it’s retained, and what monitoring is applied. Organizations that skip classification end up either over-protecting low-risk data (wasting resources) or under-protecting critical data (creating vulnerabilities).
Defense Connection
Data classification is the first line of defense against LLM02: Sensitive Information Disclosure. When organizations classify their AI data and discover that PII, proprietary business logic, or confidential documents exist in training corpora or RAG knowledge bases, they can remove or redact that sensitive content before it ever reaches the model – preventing the sensitive information disclosure attacks covered in Chapter 2.
Vector Store and Embedding Security
Vector databases are the backbone of RAG systems – the pattern you learned in Chapter 1 Section 6 and saw attacked in Chapter 2 Section 3. They store high-dimensional numerical representations (embeddings) of documents, enabling semantic similarity search that powers grounded AI responses. But vector stores were designed for retrieval performance, not security, and many deployments lack even basic access controls.
Security Controls for Vector Databases
| Control | What It Does | Why It Matters for AI |
|---|---|---|
| Authentication and authorization | Requires identity verification before any read/write operation | Prevents unauthorized corpus modifications (RAG poisoning) and unauthorized retrieval |
| Role-based access control (RBAC) | Different permissions for ingestion, querying, and administration | Separates the data pipeline team (write) from the application (read-only) from administrators (manage) |
| Encryption at rest | Encrypts stored embeddings and metadata | Protects against data exfiltration if the database storage is compromised |
| Encryption in transit | TLS for all connections to the vector database | Prevents eavesdropping on queries and results between application and database |
| Query logging and auditing | Records all queries, results, and modifications with timestamps | Enables detection of anomalous access patterns and forensic investigation |
| Integrity verification | Checksums or cryptographic hashes for stored documents and their embeddings | Detects unauthorized modifications to corpus content (poisoning detection) |
Defense Connection
Vector store access controls directly defend against LLM08: Vector and Embedding Weaknesses. The RAG poisoning attacks from Chapter 2 Section 3 succeed because attackers can inject malicious documents into the vector store. With proper RBAC, only authorized ingestion pipelines can write to the vector store, and integrity verification detects any unauthorized modifications.
Embedding Inversion Risks
A less obvious but important risk: embeddings can sometimes be inverted to recover the original source text. If an attacker gains access to a vector store, they may be able to reconstruct sensitive documents from their embeddings alone – even without access to the original document store. This makes encryption and access controls on vector databases a data privacy requirement, not just a security best practice.
Protecting RAG Corpora
RAG (Retrieval Augmented Generation) is one of the most widely deployed patterns in enterprise AI, and protecting the corpora that feed RAG systems is one of Layer 1’s most critical responsibilities.
The RAG Data Supply Chain
graph LR
SRC["Source Documents<br/><small>Internal docs, policies,<br/>knowledge base articles</small>"]
ING["Ingestion Pipeline<br/><small>Parsing, chunking,<br/>embedding generation</small>"]
VS["Vector Store<br/><small>Indexed embeddings<br/>with metadata</small>"]
RET["Retrieval<br/><small>Semantic search<br/>on user queries</small>"]
LLM["LLM Generation<br/><small>Response using<br/>retrieved context</small>"]
SRC -->|"1. Access-controlled<br/>document feed"| ING
ING -->|"2. Integrity-verified<br/>embeddings"| VS
VS -->|"3. RBAC-controlled<br/>retrieval"| RET
RET -->|"4. Context injection<br/>with source tracking"| LLM
SC1["Security Control:<br/>Source Authentication"]
SC2["Security Control:<br/>Integrity Hashing"]
SC3["Security Control:<br/>Access Controls + Logging"]
SC4["Security Control:<br/>Source Attribution"]
SC1 -.-> SRC
SC2 -.-> ING
SC3 -.-> VS
SC4 -.-> RET
style SRC fill:#2d5016,color:#fff
style ING fill:#2d5016,color:#fff
style VS fill:#2d5016,color:#fff
style RET fill:#2d5016,color:#fff
style LLM fill:#2d5016,color:#fff
style SC1 fill:#1C90F3,color:#fff
style SC2 fill:#1C90F3,color:#fff
style SC3 fill:#1C90F3,color:#fff
style SC4 fill:#1C90F3,color:#fff
Every step in the RAG data supply chain needs its own security controls. A poisoned document that enters at step 1 will flow through to step 4 and contaminate AI responses for every user – unless controls at each stage catch the problem.
Key Protection Strategies
Source Authentication: Only accept documents from verified, authorized sources. Implement allowlists for document feeds and require digital signatures or provenance metadata for ingested content. This prevents attackers from injecting documents through unmonitored channels.
Integrity Hashing: Generate and store cryptographic hashes for every document at ingestion. Periodically re-verify hashes to detect unauthorized modifications. If a document’s hash changes without a corresponding update in the change management system, flag it for investigation.
Access-Controlled Ingestion: Separate the ingestion pipeline from the retrieval pipeline using RBAC. Only authorized processes and personnel should be able to add, modify, or delete documents in the corpus. Query-only access for the application layer prevents lateral movement from a compromised AI application to the data store.
Data Lineage Tracking: Maintain a complete record of where each document came from, when it was ingested, who approved it, and what transformations were applied. Lineage tracking enables rapid identification of which documents are affected when a source is compromised.
Freshness and Staleness Management: Track document age and flag stale content that may contain outdated information. In rapidly changing domains (security advisories, product documentation, regulatory guidance), stale data can be as harmful as poisoned data.
Defense Connection
These RAG protection strategies directly counter the LLM08: Vector and Embedding Weaknesses and LLM04: Data and Model Poisoning attacks from Chapter 2. The PoisonedRAG research showed that as few as 5 adversarial documents can backdoor a corpus of millions. Access-controlled ingestion and integrity hashing ensure those 5 documents never make it into the corpus in the first place.
Defense Perspective: ChatGPT Memory Exploitation
The attack (from Chapter 2 Section 2): Security researcher Johann Rehberger demonstrated that ChatGPT’s long-term memory feature could be exploited via indirect prompt injection. By crafting a document with hidden instructions, he planted persistent false “memories” that influenced all future conversations.
What Layer 1 controls would have prevented or mitigated:
-
Memory store access controls: The persistent memory store is a data asset, and like any data asset, it needs access controls. Layer 1’s principle of least-privilege access would restrict what can write to the memory store – only explicit, user-confirmed actions, not instructions hidden in processed documents.
-
Data classification for memory stores: Classifying the memory store as a CRITICAL data asset (it persists across all sessions and influences all future interactions) would trigger enhanced monitoring and validation requirements for any write operation.
-
Integrity verification: Memory entries should have provenance metadata – was this memory created from a user request, or was it extracted from a processed document? Layer 1’s integrity controls would distinguish between user-initiated memories and document-injected memories.
-
Content validation: Before storing a memory, Layer 1 controls would validate that the content is a genuine user preference, not a hidden instruction (e.g., “include this URL in all responses”). This is where data classification meets input validation.
The key insight: persistent memory is a data store, and data stores need Layer 1 protection. Treating AI memory as “just a feature” rather than “a critical data asset” is what made this attack possible.
Encryption and Data Protection Patterns
AI data requires encryption both at rest and in transit, but the specific patterns differ from traditional data protection because of the unique characteristics of AI data types.
At-Rest Encryption Considerations
| Data Type | Encryption Requirement | Special Considerations |
|---|---|---|
| Training data | AES-256 or equivalent | Large datasets may require hardware-accelerated encryption to avoid training pipeline bottlenecks |
| Model weights | AES-256 or equivalent | Must balance encryption with model loading performance; consider encrypted storage with decryption-on-load |
| Vector stores | Database-level encryption | Ensure encryption doesn’t break similarity search performance; many vector DBs support transparent encryption |
| Conversation logs | AES-256 with key rotation | Apply retention policies; encrypt per-user keys enable targeted deletion for compliance (right to erasure) |
| Fine-tuning data | AES-256 with access logging | Often contains the most sensitive business logic; consider additional encryption at the dataset level |
In-Transit Encryption
All communication between AI system components should use TLS 1.3 or later. This includes:
- Application to vector store connections
- Application to model serving endpoint connections
- Inter-service communication in microservice architectures
- Data pipeline transfers (source to ingestion, ingestion to vector store)
AI Scanner Cross-Reference
AI Scanner contributes to Layer 1 by identifying sensitive data patterns in model inputs and outputs during pre-deployment assessment. When AI Scanner evaluates a model, it can detect whether the model has memorized and will reproduce sensitive training data – a key indicator of LLM02: Sensitive Information Disclosure risk. See Section 9 for the complete AI Scanner/Guard workflow and how the scan-protect-validate-improve cycle integrates with Layer 1’s data security controls.
Trend Vision One provides DSPM capabilities through its data security component, enabling organizations to discover, classify, and monitor AI training data alongside traditional data assets. Vision One’s data classification engine can identify when sensitive information – PII, proprietary business logic, or confidential documents – appears in AI training corpora or RAG knowledge bases. By integrating DSPM into the same platform that manages the other five Blueprint layers, organizations gain a single view of their data security posture across both traditional and AI data assets.
Key Takeaways
- Data Security Posture Management (DSPM) extends to AI-specific data types including training corpora, embeddings, vector stores, and conversation logs through a continuous discover-classify-monitor-protect workflow
- AI data classification by sensitivity tier drives proportional security controls, with fine-tuning datasets and model weights rated CRITICAL due to business logic and IP exposure risks
- Vector store security requires authentication, RBAC, encryption, and integrity verification to prevent RAG poisoning and embedding inversion attacks
- Data lineage tracking across the RAG supply chain enables rapid identification of compromised sources and prevents poisoned documents from reaching production
Test Your Knowledge
Ready to test your understanding of AI data security? Head to the quiz to check your knowledge.
Up next
With the data foundation secured, the next layer protects the models that sit on top of that data. In Section 4, you’ll learn about Layer 2: Secure Your AI Models – including container security, vulnerability scanning, model integrity verification, and how to protect the model supply chain from the attacks you studied in Chapter 2.