4. Technical Foundations :: Introduction to AI Security

4. Technical Foundations :: Introduction to AI Security https://example.org/chapter1/s4/index.html Introduction Now that we’ve explored how AI evolved into its current form, let’s lift the hood and examine the engine that powers large language models (LLMs). These systems are marvels of engineering, built on a foundation of interconnected components that work together to process and generate human-like text. What will I get out of this? By the end of this section, you will be able to: Hugo en-us Section 4 Quiz https://example.org/chapter1/s4/activity/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://example.org/chapter1/s4/activity/index.html Test Your Knowledge: Technical Foundations Let’s see how much you’ve learned! This quiz tests your understanding of tokenization, context windows, Transformer architecture, Mixture-of-Experts, memory implementations, and the fundamentals of how LLMs process and generate text. --- shuffle_answers: true shuffle_questions: false --- ## A legal document contains 4,000 words of specialized terminology. When processed by an LLM, the token count comes to approximately 6,500. Why is the token count so much higher than the word count? > Hint: Think about how tokenizers handle uncommon or specialized words. - [ ] The tokenizer adds extra tokens for formatting and punctuation > While punctuation creates some additional tokens, this alone wouldn't account for a 60% increase. - [x] Specialized legal terminology often gets split into multiple tokens because these terms are less common in the model's training data > Correct! Tokenizers are trained to efficiently encode common words as single tokens, but rare or specialized terms (like legal jargon) get split into sub-word pieces. "Indemnification" might become "indemn" + "ification" while common words like "the" remain single tokens. This is why the 0.75 words-per-token average doesn't hold for specialized text. - [ ] The model duplicates tokens for accuracy > Models don't duplicate tokens. Each token represents a unique position in the sequence. - [ ] Legal documents require double encoding for security > There is no "double encoding" in standard tokenization. All text is tokenized the same way regardless of content type. ## Gemini 1.5 Pro offers a context window of over 2 million tokens. What practical capability does this enable that a 128K-token model cannot provide? > Hint: Think about what you can fit into 2 million tokens versus 128K tokens. - [ ] It allows the model to train itself on new information during the conversation > Models don't retrain during conversations regardless of context window size. Weights are fixed. - [ ] It makes the model respond faster > Larger context windows actually increase latency and cost, not speed. - [x] It can process multiple novels, entire codebases, or hundreds of documents simultaneously in a single interaction > Correct! 2M+ tokens is equivalent to roughly 1.5 million words -- enough to process several books, an entire large codebase, or hundreds of documents at once. A 128K model can handle about 96,000 words (one long book), but the 2M model can handle over 15 times more in a single session. - [ ] It eliminates the need for RAG-based retrieval systems > While very large context windows reduce some RAG use cases, cost considerations (you pay per token) and "lost in the middle" problems mean RAG remains valuable even with massive context windows. ## What is the "lost in the middle" problem in long-context models? > Hint: Think about how attention patterns work across very long sequences. - [ ] Data in the middle of the context window gets deleted during processing > Data isn't deleted -- all tokens remain in the context. The issue is about attention, not data loss. - [ ] Models can only read the first and last paragraphs of a document > Models process all tokens, but their attention allocation isn't uniform. - [x] Models may struggle to attend to information positioned in the middle of very long contexts, giving more weight to information near the beginning and end > Correct! Research has shown that LLMs exhibit a recency and primacy bias -- they tend to pay more attention to information at the start and end of long contexts, sometimes missing critical details in the middle. This is why context management and strategic placement of information remain important even with massive context windows. - [ ] The model produces garbled text in the middle of its responses > Output quality issues aren't related to the "lost in the middle" phenomenon, which is about input attention patterns. ## How does Mixture-of-Experts (MoE) architecture achieve better efficiency than a standard dense Transformer? > Hint: Consider what happens to parameters during each inference request. - [ ] It compresses all parameters into a smaller space > MoE doesn't compress parameters -- it selectively activates them. - [ ] It removes unnecessary parameters after training > Parameters aren't removed. All expert weights are maintained and available. - [x] It routes each token to a subset of specialized "expert" sub-networks rather than activating every parameter, reducing per-request compute while maintaining total model capability > Correct! MoE uses a gating mechanism to select which experts process each token. DeepSeek V3 has 671B total parameters but activates only 37B per request. This means you get the breadth of knowledge from many experts while only paying the compute cost of the active subset. - [ ] It trains faster because each expert learns independently > While experts do specialize, MoE's primary advantage is inference efficiency, not training speed. ## An LLM chatbot appears to "forget" information a user shared 30 messages ago in a conversation. What is the most likely technical explanation? > Hint: Remember the fundamental nature of LLMs regarding state and memory. - [ ] The model's weights degraded during the conversation > Model weights are fixed and never change during inference or conversations. - [ ] The model intentionally ignored the old information > LLMs don't have intentions. The issue is technical, not behavioral. - [x] The conversation exceeded the context window limit, and earlier messages were truncated or summarized to fit > Correct! LLMs are inherently stateless -- they have no built-in memory between requests. What we call "memory" is actually conversation history included in the prompt. Once a conversation exceeds the context window, older messages must be dropped or summarized, causing the model to lose access to that information. - [ ] The user's messages were corrupted in the database > While possible, this is a general software issue, not the most likely explanation given how LLMs work. ## A company implements RAG (Retrieval-Augmented Generation) instead of relying solely on a long-context model. Which scenario best justifies this decision? > Hint: Think about the trade-offs between including everything in context versus retrieving only what's relevant. - [ ] They want the simplest possible architecture with no additional infrastructure > RAG adds complexity (vector database, embedding pipeline). A long-context model is simpler to implement. - [x] They have a knowledge base of 100,000 documents that changes daily, and want cost-efficient queries that only retrieve relevant information per request > Correct! RAG excels when you have large, frequently updated knowledge bases. Including 100K documents in context every request would be prohibitively expensive and slow. RAG retrieves only the 3-10 most relevant chunks per query, dramatically reducing cost while keeping information current through real-time updates to the vector database. - [ ] They need the model to understand relationships between all documents simultaneously > For cross-document relationship understanding, long-context might actually be better since RAG only retrieves specific chunks. - [ ] They want to avoid using embeddings > RAG relies on embeddings for similarity search. If avoiding embeddings, RAG wouldn't be the right choice. ## Efficient attention mechanisms (like Flash Attention and Grouped Query Attention) solve what fundamental problem with the original Transformer architecture? > Hint: Think about how computational cost scales with sequence length in standard attention. - [ ] They allow Transformers to process images in addition to text > Efficient attention is about computational scaling, not multimodal capability. - [ ] They eliminate the need for GPU hardware > Efficient attention reduces compute requirements but still requires GPU acceleration. - [x] Standard attention has quadratic cost with sequence length -- doubling the input quadruples the compute -- and efficient attention variants reduce this to make long-context models practical > Correct! The original self-attention mechanism computes relationships between every pair of tokens, making cost grow quadratically (O(n^2)) with sequence length. At 2M tokens, this would be computationally impossible without efficient attention variants that reduce memory usage and computation while preserving model quality. - [ ] They make models train faster but don't affect inference > Efficient attention benefits both training and inference, making long sequences practical at both stages.