Section 6 Quiz

Test Your Knowledge: Inference Techniques

Let’s see how much you’ve learned!

This quiz tests your understanding of API integration types, RAG pipeline components, embeddings, streaming vs. synchronous responses, and cost optimization strategies.

--- shuffle_answers: true shuffle_questions: false --- ## What is the key difference between the Completion API and the Chat API? > Hint: Think about how conversations are structured in each approach. - [ ] The Completion API is faster while the Chat API is more accurate > Speed and accuracy aren't the distinguishing factors between these API types. - [x] The Completion API is designed for single-turn, self-contained interactions, while the Chat API provides structured role-based message handling (system, user, assistant) for multi-turn conversations > Correct! The Completion API treats each interaction as standalone with no built-in conversation structure. The Chat API organizes messages with explicit roles, making it natural for ongoing dialogue. In 2025/2026, the Chat API has become the default for most applications, even for single-turn tasks. - [ ] The Chat API only works with chatbots while the Completion API works with all applications > Both APIs can be used for various applications, but their design patterns differ. - [ ] The Completion API uses more tokens than the Chat API for the same task > Token usage depends on the content, not the API type. The Completion API can actually be more token-efficient for standalone tasks. ## In a RAG pipeline, what is the purpose of the "chunking" step during document ingestion? > Hint: Think about why you can't just send entire documents to the model. - [ ] To encrypt the documents for secure storage > Chunking is about splitting documents, not encrypting them. - [ ] To translate documents into multiple languages > Chunking involves splitting, not translating. - [x] To split documents into smaller, focused pieces so that only the most relevant portions can be retrieved and included in the LLM's context window > Correct! Chunking serves two purposes: (1) LLMs have finite context windows, so entire documents often can't be included, and (2) smaller, focused chunks lead to more precise retrieval -- when someone asks about topic X, the system retrieves just the chunks about topic X rather than entire documents that might also contain irrelevant information. - [ ] To remove unnecessary words and reduce document size > Chunking preserves the original text within chunks -- it splits without removing content. ## An embedding model converts the text "machine learning" into the vector [0.42, 0.78, 0.15, ...]. What makes this an "embedding" rather than just a "vector"? > Hint: Recall the critical distinction between vectors and embeddings. - [ ] Embeddings always have exactly 3 dimensions while vectors can have any number > Modern embeddings typically have 256-3072 dimensions. The number of dimensions doesn't distinguish them from vectors. - [ ] Embeddings use decimal numbers while vectors use integers > Both can use any numerical format. The distinction isn't about number types. - [x] The values were learned through training to capture semantic relationships -- similar concepts (like "deep learning") would have nearby values in this space > Correct! All embeddings are vectors, but embeddings are special because their values are learned to represent meaning. "Machine learning" and "deep learning" would have similar embeddings because they're related concepts, while "machine learning" and "basketball" would be far apart. This learned semantic mapping is what makes embeddings useful for similarity search in RAG. - [ ] Embeddings are stored in databases while vectors exist only in memory > Storage location doesn't define the concept. Both can be stored anywhere. ## A company's RAG system retrieves chunks about "Python programming" when a user asks about "python snakes." What stage of the RAG pipeline most likely caused this error? > Hint: Think about which component determines what counts as "similar." - [ ] The chunking strategy split snake information across too many pieces > If the information exists in the knowledge base, chunking wouldn't confuse programming with snakes. - [x] The embedding model encoded both uses of "python" with similar vectors, and the similarity search couldn't distinguish the intent > Correct! Embeddings capture semantic similarity, and "python" in different contexts (programming language vs. snake) can produce overlapping vector representations. The embedding model and retrieval step failed to disambiguate. This is why query preprocessing and domain-specific embeddings are important for accurate RAG systems. - [ ] The LLM hallucinated programming content instead of snake information > The LLM responds based on what's retrieved. If programming chunks are retrieved, the LLM will answer about programming -- this is a retrieval problem, not a generation problem. - [ ] The vector database corrupted the stored embeddings > Database corruption would cause broader failures, not topic-specific confusion. ## Streaming response handling is preferred over synchronous handling for chat interfaces because: > Hint: Think about user experience when waiting for an AI response. - [ ] Streaming produces higher quality responses than synchronous processing > Response quality is identical -- the same tokens are generated. Only the delivery method differs. - [ ] Streaming uses fewer tokens than synchronous responses > Token usage is the same regardless of delivery method. - [x] Users see text appearing word-by-word immediately rather than waiting for the entire response to complete, significantly reducing perceived latency > Correct! Streaming delivers tokens as they're generated, so users see the response building in real time. For a response that takes 5 seconds to generate fully, the user sees the first words within milliseconds rather than staring at a blank screen for 5 seconds. This progressive rendering dramatically improves the user experience. - [ ] Streaming prevents the model from hallucinating > Streaming doesn't affect the content generated -- it only changes how that content is delivered to the user. ## A system prompt containing 500 tokens is included in every API request. The application handles 100,000 requests per day using GPT-4o (input: $2.50/1M tokens). What is the annual cost of JUST the system prompt? > Hint: Calculate the daily token usage from the system prompt, then scale to annual cost. - [ ] About $1,250 per year > Check your math -- this underestimates the daily cost. - [ ] About $4,562 per year > This isn't the correct calculation. Work through the token math step by step. - [x] About $45,625 per year (500 tokens x 100K requests = 50M tokens/day, at $2.50/1M = $125/day, x 365 = $45,625) > Correct! This demonstrates why prompt optimization matters at scale. Cutting the system prompt from 500 to 200 tokens would save about $27,375 per year. Every token in a system prompt is paid for on every single request, making concise prompts a significant cost lever. - [ ] About $125 per year > This is the daily cost, not the annual cost. Multiply by 365. ## Which cost optimization strategy would have the LARGEST impact for an application where 70% of queries are simple FAQ-style questions and 30% require complex analysis? > Hint: Think about which single change would save the most money across all requests. - [ ] Implementing prompt caching for repeated questions > Caching helps but only for exact or near-duplicate queries. It doesn't help with the 30% complex queries or the many unique simple queries. - [x] Model selection by task complexity -- routing simple queries to a cheap model (GPT-4o mini/Haiku) and complex queries to a capable model (GPT-4o/Sonnet) > Correct! If 70% of requests can be handled by a model that costs 10-20x less, the savings are enormous. For example, routing FAQ queries to GPT-4o mini ($0.15/1M input) instead of GPT-4o ($2.50/1M input) saves over 90% on those requests. This single optimization -- using the right model for each task -- is consistently the biggest cost lever. - [ ] Setting max_tokens to 50 for all responses > This would truncate complex responses and degrade quality for the 30% of queries that need detailed analysis. - [ ] Switching to a single self-hosted model for all queries > Self-hosting has high fixed costs and may not match the quality of frontier models for complex tasks, potentially degrading the 30% that need advanced analysis.