Section 3 Quiz

Test Your Knowledge: Layer 1 - Secure Your Data

Let’s see how much you’ve learned!

This quiz tests your understanding of DSPM for AI, data classification tiers, vector store security, RAG corpus protection, and data lineage tracking.

--- shuffle_answers: true shuffle_questions: false --- ## An organization discovers that its vector database powering a RAG system has no authentication -- any application on the network can read from and write to the corpus. Which Layer 1 control is missing, and what attack does this enable? > Hint: Think about what an attacker could do with unrestricted write access to a RAG corpus. - [ ] Encryption at rest is missing, enabling data exfiltration > Encryption at rest protects stored data from being read if the storage medium is compromised. The issue here is access control, not encryption -- anyone can read AND write without authentication. - [x] Authentication and RBAC are missing, enabling RAG corpus poisoning -- an attacker can inject malicious documents that the model will retrieve and present to users > Correct! Without authentication and role-based access control, any application can write to the vector store, injecting poisoned documents. This directly enables the RAG poisoning attacks from Chapter 2 Section 3, where as few as 5 adversarial documents can backdoor a corpus of millions. RBAC should separate ingestion (write) from retrieval (read-only) from administration. - [ ] Data lineage tracking is missing, enabling data freshness issues > Data lineage tracking records provenance, not access control. While lineage is important, the immediate vulnerability is the lack of authentication allowing unrestricted writes. - [ ] DSPM discovery is missing, causing the vector database to be invisible to security teams > DSPM discovery helps find unknown data assets, but the immediate issue is that a known vector database lacks basic access controls. ## A security analyst is classifying AI data assets using the sensitivity tier framework. Fine-tuning datasets containing domain-specific Q&A pairs with proprietary business logic should be classified at which tier? > Hint: Consider what fine-tuning data contains and what would happen if it were compromised. - [ ] MEDIUM -- fine-tuning data is derived from existing knowledge and has limited sensitivity > Fine-tuning data containing proprietary business logic is among the most sensitive AI data assets. MEDIUM classification would result in insufficient protection. - [ ] HIGH -- the same tier as training corpora and RAG corpora > While training corpora and RAG corpora are typically HIGH, fine-tuning data with proprietary business logic represents an even higher sensitivity because it contains curated, domain-specific examples that encode competitive advantage. - [x] CRITICAL -- fine-tuning datasets containing proprietary business logic are the most sensitive AI data assets, requiring strict access controls, version control, and sensitivity scanning > Correct! Fine-tuning datasets are classified as CRITICAL because they often contain the most concentrated proprietary knowledge -- curated domain-specific examples, business logic, competitive strategies, and decision criteria. Compromising fine-tuning data through poisoning could corrupt the model's behavior in business-critical domains, and theft of fine-tuning data could expose core business intellectual property. - [ ] Classification depends on the size of the dataset, not its content > Data classification is based on content sensitivity and business impact, not dataset size. A small fine-tuning dataset of 1,000 curated business logic examples may be more sensitive than a 100GB training corpus of public text. ## The DSPM workflow for AI follows four stages: Discover, Classify, Monitor, and Protect. Why must this workflow be continuous rather than a one-time assessment? > Hint: Think about how AI data assets change over time. - [ ] Regulatory requirements mandate continuous DSPM for all organizations > While some regulations require ongoing monitoring, the technical reason for continuous DSPM is the dynamic nature of AI data assets, not regulatory mandates. - [ ] Continuous DSPM is cheaper than periodic assessments > Cost is not the primary driver. Continuous DSPM is necessary because of how rapidly AI data changes, regardless of cost considerations. - [x] AI data assets change frequently -- RAG corpora are updated, conversation logs accumulate, new fine-tuning datasets are created -- and DSPM must keep pace with these changes to maintain accurate classification and monitoring > Correct! Unlike traditional databases that change at predictable rates, AI data assets are highly dynamic. RAG corpora receive new documents daily, conversation logs grow continuously, fine-tuning datasets are created as models are adapted, and vector stores are rebuilt as embedding models are updated. A one-time DSPM assessment would become outdated within days. - [ ] The Discover phase must run continuously because AI systems create new data types that didn't exist before > While new data types do emerge, the more common reason is that existing data categories (RAG corpora, conversation logs, fine-tuning datasets) change frequently and need continuous reassessment. ## An organization's RAG system ingests documents from an internal knowledge base. A security review reveals that the ingestion pipeline accepts documents from any authenticated user with no content validation. What RAG protection strategy should be implemented first? > Hint: Consider the most direct defense against poisoned document injection. - [x] Access-controlled ingestion with RBAC -- only authorized processes and personnel should be able to add documents to the corpus, separate from query access > Correct! Access-controlled ingestion is the most critical RAG protection. By requiring that only authorized ingestion pipelines (not any authenticated user) can write to the corpus, the organization eliminates the most common attack vector for RAG poisoning. Query-only access for the application layer prevents lateral movement from a compromised AI application to the data store. - [ ] Encryption at rest for the vector store > Encryption protects data from being read if storage is compromised, but it doesn't prevent authorized users from injecting poisoned documents through the legitimate ingestion pathway. - [ ] Data freshness tracking to flag stale documents > Freshness tracking addresses outdated information, not deliberately poisoned documents. The immediate risk is unauthorized document injection, not staleness. - [ ] Query logging to detect anomalous retrieval patterns > Query logging helps detect suspicious access patterns, but it operates after documents are already in the corpus. The defense needs to prevent poisoned documents from entering in the first place. ## An AI system's persistent memory store is exploited through indirect prompt injection -- hidden instructions in a document plant false "memories" that influence all future conversations. Which Layer 1 principle would most directly prevent this attack? > Hint: Think about what type of data asset persistent memory is and what controls it needs. - [ ] Encryption at rest for the memory store prevents unauthorized reading > Encryption prevents data theft but doesn't prevent the injection of false memories through the legitimate processing pathway. The attack uses the model's own authorized access to write poisoned entries. - [ ] Data lineage tracking would identify the source of the false memories > Lineage tracking would help with forensic investigation after the fact, but it doesn't prevent the injection. The immediate defense needs to validate what gets written to memory. - [x] Classifying the memory store as a CRITICAL data asset triggers enhanced validation requirements for write operations -- only explicit, user-confirmed actions should create memory entries, not instructions hidden in processed documents > Correct! The ChatGPT Memory Exploitation succeeded because the memory store wasn't treated as a critical data asset. By classifying it as CRITICAL, Layer 1 controls require that write operations have provenance metadata (was this memory created from a user request or extracted from a document?), content validation (is this a genuine preference or a hidden instruction?), and access controls (only authorized write paths). - [ ] DSPM discovery would have found the memory store and flagged it > DSPM discovery helps identify unknown data assets, but the memory store was a known feature. The issue was not that it was undiscovered but that it lacked appropriate write controls. ## Why does the section describe embedding inversion as a privacy risk for vector stores, even when the original source documents are separately protected? > Hint: Consider what information can be reconstructed from numerical vector representations alone. - [ ] Embeddings contain the original text in compressed format > Embeddings are numerical representations (dense vectors), not compressed text. However, under certain conditions the original text can be approximately reconstructed from these vectors. - [x] Embeddings can sometimes be inverted to recover approximate source text -- if an attacker gains access to a vector store, they may reconstruct sensitive documents from their embeddings alone, even without access to the original document store > Correct! Embedding inversion is a real privacy risk. Research has shown that text embeddings can be approximately reconstructed to recover meaningful portions of the source text. This means that encrypting and protecting the original document store is necessary but not sufficient -- the vector store itself contains a derivative representation that could expose sensitive content if accessed by an unauthorized party. - [ ] Embeddings reveal which model was used to generate them, exposing model architecture secrets > While embedding characteristics can reveal some information about the model, the primary privacy risk is reconstructing the source content, not identifying the embedding model. - [ ] Embedding inversion is only theoretical and has never been demonstrated > Embedding inversion has been demonstrated in published research. It is a practical risk, not merely theoretical, especially for high-dimensional embeddings of short text passages. ## A company wants to implement data lineage tracking for its RAG corpus. Which of the following is NOT a component of a complete data lineage record? > Hint: Review the four components of RAG data lineage described in the section. - [ ] Where each document came from (source authentication) > Source information is a core lineage component -- tracking which organization, feed, or individual provided the document. - [ ] When the document was ingested and who approved it > Ingestion date and approver are core lineage components for accountability and timeline reconstruction. - [x] The embedding model version used to generate the document's vector representation > Correct! While tracking the embedding model version is good practice for reproducibility, the section's data lineage framework focuses on four components: source (where it came from), timing (when ingested), authorization (who approved), and transformations (what was done to it). The embedding model version is a technical metadata field, not a core lineage component for security purposes. - [ ] What transformations were applied to the document during ingestion > Transformation tracking is a core lineage component -- recording parsing, chunking, cleaning, and any other modifications applied during ingestion.