<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Section 6 Quiz :: Introduction to AI Security</title>
    <link>https://example.org/chapter1/s6/activity/index.html</link>
    <description>Test Your Knowledge: Inference Techniques Let’s see how much you’ve learned! This quiz tests your understanding of API integration types, RAG pipeline components, embeddings, streaming vs. synchronous responses, and cost optimization strategies.&#xA;--- shuffle_answers: true shuffle_questions: false --- ## What is the key difference between the Completion API and the Chat API? &gt; Hint: Think about how conversations are structured in each approach. - [ ] The Completion API is faster while the Chat API is more accurate &gt; Speed and accuracy aren&#39;t the distinguishing factors between these API types. - [x] The Completion API is designed for single-turn, self-contained interactions, while the Chat API provides structured role-based message handling (system, user, assistant) for multi-turn conversations &gt; Correct! The Completion API treats each interaction as standalone with no built-in conversation structure. The Chat API organizes messages with explicit roles, making it natural for ongoing dialogue. In 2025/2026, the Chat API has become the default for most applications, even for single-turn tasks. - [ ] The Chat API only works with chatbots while the Completion API works with all applications &gt; Both APIs can be used for various applications, but their design patterns differ. - [ ] The Completion API uses more tokens than the Chat API for the same task &gt; Token usage depends on the content, not the API type. The Completion API can actually be more token-efficient for standalone tasks. ## In a RAG pipeline, what is the purpose of the &#34;chunking&#34; step during document ingestion? &gt; Hint: Think about why you can&#39;t just send entire documents to the model. - [ ] To encrypt the documents for secure storage &gt; Chunking is about splitting documents, not encrypting them. - [ ] To translate documents into multiple languages &gt; Chunking involves splitting, not translating. - [x] To split documents into smaller, focused pieces so that only the most relevant portions can be retrieved and included in the LLM&#39;s context window &gt; Correct! Chunking serves two purposes: (1) LLMs have finite context windows, so entire documents often can&#39;t be included, and (2) smaller, focused chunks lead to more precise retrieval -- when someone asks about topic X, the system retrieves just the chunks about topic X rather than entire documents that might also contain irrelevant information. - [ ] To remove unnecessary words and reduce document size &gt; Chunking preserves the original text within chunks -- it splits without removing content. ## An embedding model converts the text &#34;machine learning&#34; into the vector [0.42, 0.78, 0.15, ...]. What makes this an &#34;embedding&#34; rather than just a &#34;vector&#34;? &gt; Hint: Recall the critical distinction between vectors and embeddings. - [ ] Embeddings always have exactly 3 dimensions while vectors can have any number &gt; Modern embeddings typically have 256-3072 dimensions. The number of dimensions doesn&#39;t distinguish them from vectors. - [ ] Embeddings use decimal numbers while vectors use integers &gt; Both can use any numerical format. The distinction isn&#39;t about number types. - [x] The values were learned through training to capture semantic relationships -- similar concepts (like &#34;deep learning&#34;) would have nearby values in this space &gt; Correct! All embeddings are vectors, but embeddings are special because their values are learned to represent meaning. &#34;Machine learning&#34; and &#34;deep learning&#34; would have similar embeddings because they&#39;re related concepts, while &#34;machine learning&#34; and &#34;basketball&#34; would be far apart. This learned semantic mapping is what makes embeddings useful for similarity search in RAG. - [ ] Embeddings are stored in databases while vectors exist only in memory &gt; Storage location doesn&#39;t define the concept. Both can be stored anywhere. ## A company&#39;s RAG system retrieves chunks about &#34;Python programming&#34; when a user asks about &#34;python snakes.&#34; What stage of the RAG pipeline most likely caused this error? &gt; Hint: Think about which component determines what counts as &#34;similar.&#34; - [ ] The chunking strategy split snake information across too many pieces &gt; If the information exists in the knowledge base, chunking wouldn&#39;t confuse programming with snakes. - [x] The embedding model encoded both uses of &#34;python&#34; with similar vectors, and the similarity search couldn&#39;t distinguish the intent &gt; Correct! Embeddings capture semantic similarity, and &#34;python&#34; in different contexts (programming language vs. snake) can produce overlapping vector representations. The embedding model and retrieval step failed to disambiguate. This is why query preprocessing and domain-specific embeddings are important for accurate RAG systems. - [ ] The LLM hallucinated programming content instead of snake information &gt; The LLM responds based on what&#39;s retrieved. If programming chunks are retrieved, the LLM will answer about programming -- this is a retrieval problem, not a generation problem. - [ ] The vector database corrupted the stored embeddings &gt; Database corruption would cause broader failures, not topic-specific confusion. ## Streaming response handling is preferred over synchronous handling for chat interfaces because: &gt; Hint: Think about user experience when waiting for an AI response. - [ ] Streaming produces higher quality responses than synchronous processing &gt; Response quality is identical -- the same tokens are generated. Only the delivery method differs. - [ ] Streaming uses fewer tokens than synchronous responses &gt; Token usage is the same regardless of delivery method. - [x] Users see text appearing word-by-word immediately rather than waiting for the entire response to complete, significantly reducing perceived latency &gt; Correct! Streaming delivers tokens as they&#39;re generated, so users see the response building in real time. For a response that takes 5 seconds to generate fully, the user sees the first words within milliseconds rather than staring at a blank screen for 5 seconds. This progressive rendering dramatically improves the user experience. - [ ] Streaming prevents the model from hallucinating &gt; Streaming doesn&#39;t affect the content generated -- it only changes how that content is delivered to the user. ## A system prompt containing 500 tokens is included in every API request. The application handles 100,000 requests per day using GPT-4o (input: $2.50/1M tokens). What is the annual cost of JUST the system prompt? &gt; Hint: Calculate the daily token usage from the system prompt, then scale to annual cost. - [ ] About $1,250 per year &gt; Check your math -- this underestimates the daily cost. - [ ] About $4,562 per year &gt; This isn&#39;t the correct calculation. Work through the token math step by step. - [x] About $45,625 per year (500 tokens x 100K requests = 50M tokens/day, at $2.50/1M = $125/day, x 365 = $45,625) &gt; Correct! This demonstrates why prompt optimization matters at scale. Cutting the system prompt from 500 to 200 tokens would save about $27,375 per year. Every token in a system prompt is paid for on every single request, making concise prompts a significant cost lever. - [ ] About $125 per year &gt; This is the daily cost, not the annual cost. Multiply by 365. ## Which cost optimization strategy would have the LARGEST impact for an application where 70% of queries are simple FAQ-style questions and 30% require complex analysis? &gt; Hint: Think about which single change would save the most money across all requests. - [ ] Implementing prompt caching for repeated questions &gt; Caching helps but only for exact or near-duplicate queries. It doesn&#39;t help with the 30% complex queries or the many unique simple queries. - [x] Model selection by task complexity -- routing simple queries to a cheap model (GPT-4o mini/Haiku) and complex queries to a capable model (GPT-4o/Sonnet) &gt; Correct! If 70% of requests can be handled by a model that costs 10-20x less, the savings are enormous. For example, routing FAQ queries to GPT-4o mini ($0.15/1M input) instead of GPT-4o ($2.50/1M input) saves over 90% on those requests. This single optimization -- using the right model for each task -- is consistently the biggest cost lever. - [ ] Setting max_tokens to 50 for all responses &gt; This would truncate complex responses and degrade quality for the 30% of queries that need detailed analysis. - [ ] Switching to a single self-hosted model for all queries &gt; Self-hosting has high fixed costs and may not match the quality of frontier models for complex tasks, potentially degrading the 30% that need advanced analysis.</description>
    <generator>Hugo</generator>
    <language>en-us</language>
    <atom:link href="https://example.org/chapter1/s6/activity/index.xml" rel="self" type="application/rss+xml" />
  </channel>
</rss>