Section 5 Quiz :: Introduction to AI Security

Section 5 Quiz :: Introduction to AI Security https://example.org/chapter1/s5/activity/index.html Test Your Knowledge: Prompt Engineering Let’s see how much you’ve learned! This quiz tests your understanding of prompt engineering techniques, reasoning model optimization, parameter tuning, and best practices for effective LLM interaction. --- shuffle_answers: true shuffle_questions: false --- ## A developer wants to use an LLM to consistently classify customer feedback into categories (BUG, FEATURE_REQUEST, POSITIVE, NEGATIVE). Which prompting technique would be most effective? > Hint: Think about which technique ensures consistent output format across multiple inputs. - [ ] Zero-shot prompting with a simple instruction > While zero-shot might work for basic classification, it often produces inconsistent output formats and may miss nuanced categories. - [x] Few-shot prompting with 3-5 examples showing the desired input-output pattern for each category > Correct! Few-shot prompting is ideal for classification tasks that require consistent formatting. By providing examples of each category, the model learns the exact pattern, output format, and how to distinguish between similar categories. 3-5 diverse examples is usually the sweet spot. - [ ] Chain-of-thought prompting that asks the model to reason through each classification > CoT adds unnecessary complexity for straightforward classification. It's better suited for multi-step reasoning problems. - [ ] Maximum temperature settings to explore creative classification options > High temperature would increase randomness, making classification less reliable. Classification tasks benefit from low temperature (0.1-0.3) for consistency. ## When using a reasoning model like o3 or DeepSeek R1, which prompting approach is most effective? > Hint: Consider how reasoning models fundamentally differ from traditional LLMs in their processing. - [ ] Detailed step-by-step instructions telling the model exactly how to analyze the problem > This can actually hinder reasoning models. Providing explicit step-by-step instructions may interfere with their built-in deliberation process. - [ ] Providing many few-shot examples to demonstrate the desired reasoning pattern > Few-shot examples can constrain a reasoning model's approach, potentially reducing quality compared to letting it reason independently. - [x] A concise, direct prompt that states what you want without prescribing how to get there > Correct! Reasoning models have built-in multi-step reasoning capabilities. They automatically decompose problems, evaluate approaches, and verify their work. Telling them "how to think" is like telling a skilled detective to "remember to look for clues" -- unnecessary at best, counterproductive at worst. - [ ] Adding "Let's think step by step" to trigger chain-of-thought reasoning > This magic phrase is effective for traditional LLMs but unnecessary for reasoning models that already think step-by-step internally. It can actually interfere with their reasoning process. ## A security analyst wants an LLM to produce a structured vulnerability report in JSON format. Which combination of prompt elements would be most effective? > Hint: Think about which prompt components ensure both the right content and the right format. - [ ] Just asking "Find vulnerabilities in this code and output JSON" > This is too vague. Without a schema, the model will produce inconsistent JSON structures across different requests. - [ ] Providing only the JSON schema without any examples > A schema alone helps but may not produce consistent results without at least one example showing how to map findings to the schema. - [x] Clear task instructions, the exact JSON schema with field types, and one example showing a complete input-to-output mapping > Correct! Structured output prompting works best when you combine: (1) clear instructions about the analysis task, (2) the exact schema you want (with field names, types, and possible values), and (3) at least one complete example. This ensures both analytical quality and format consistency. - [ ] Chain-of-thought reasoning followed by a request to "format as JSON" > Asking for reasoning first and JSON formatting second often produces messy results where the reasoning gets mixed into the JSON output. ## What is the effect of setting temperature to 0.1 versus 0.9 when generating responses? > Hint: Consider what "randomness" means in the context of token selection. - [ ] Temperature 0.1 produces shorter responses while 0.9 produces longer ones > Temperature controls randomness in word selection, not response length. Response length is controlled by max_tokens. - [x] Temperature 0.1 produces more deterministic, focused responses while 0.9 produces more creative, varied responses -- running the same prompt multiple times shows much more variation at 0.9 > Correct! Temperature controls the probability distribution over the next token. At low temperature, the model strongly favors the most probable tokens, producing consistent and predictable output. At high temperature, the distribution flattens, giving less probable tokens a better chance, leading to more creative but less predictable output. - [ ] Temperature 0.1 is for text tasks while 0.9 is for code generation > Temperature applies to all generation tasks. The optimal value depends on the desired creativity-consistency trade-off, not the task category. - [ ] There is no meaningful difference between these settings > Temperature has a significant effect on output characteristics, especially at extreme values. ## Top-P (nucleus) sampling set to 0.5 means: > Hint: Think about which tokens the model considers when generating each word. - [ ] The model only outputs 50% of its intended response > Top-P doesn't affect response length. It affects which tokens are considered at each generation step. - [ ] The model uses 50% of its parameters for inference > Top-P has nothing to do with parameter utilization. It's a sampling strategy for token selection. - [x] The model only considers the smallest set of tokens whose cumulative probability reaches 50%, ignoring less likely alternatives > Correct! Top-P sampling filters the token selection to only include the most probable tokens that together account for the specified probability mass. At p=0.5, only tokens in the top 50% probability mass are considered. This is more focused than p=0.9, which considers the top 90% -- resulting in more conservative, predictable text. - [ ] The model processes the input at 50% speed for better quality > Top-P doesn't affect processing speed. It affects the diversity of token selection during generation. ## A prompt includes: "You are an expert cybersecurity analyst. Never generate executable code without safety checks. Keep responses under 300 words. If unsure about a threat classification, say so." Which prompt component categories are represented here? > Hint: Match each instruction to the anatomy of an effective prompt. - [ ] Only task instructions and format specifications > This misses the constraint/guardrail elements that are clearly present. - [ ] Only constraints and examples > There are no examples in this prompt, and task context (the role) is also present. - [x] Context (role assignment), constraints/guardrails (no unsafe code, word limit), and error handling instructions (acknowledge uncertainty) > Correct! The prompt combines: (1) Context/background through role assignment ("expert cybersecurity analyst"), (2) constraints including a safety guardrail ("never generate executable code without safety checks") and format specification ("under 300 words"), and (3) an error handling instruction ("if unsure, say so"). Only task instructions and examples are missing. - [ ] Examples, task instructions, and format specifications > There are no examples (few-shot demonstrations) in this prompt. ## An application sends the same prompt to an LLM 100 times at temperature 0 and gets slightly different outputs on 3 occasions. What is the most likely explanation? > Hint: Consider what "deterministic" means in practice for LLM inference. - [ ] Temperature 0 doesn't actually work and is just a marketing feature > Temperature 0 does significantly reduce randomness, but the system isn't perfectly deterministic. - [x] Even at temperature 0, GPU floating-point arithmetic, batching, and parallel processing can introduce tiny numerical differences that occasionally lead to different token selections > Correct! Temperature 0 makes the model select the highest-probability token at each step, but floating-point operations on GPUs aren't perfectly deterministic. When two tokens have very similar probabilities, minute numerical differences from parallel processing can tip the selection differently. This is a known behavior in production LLM systems. - [ ] The model's weights are updating between requests > Model weights don't change during inference. They are fixed after training. - [ ] The API is randomly selecting different models behind the scenes > While load balancing exists, this wouldn't explain the pattern. The explanation is in the hardware-level numerics. Hugo en-us