5. Prompt Engineering

Introduction

At their core, LLMs work by responding to “prompts” – text inputs that tell the model what we want it to do. Think of a prompt as a conversation starter or instruction that guides the AI’s response. However, there’s more complexity to prompts than meets the eye, especially when working with different API types, managing conversations, and optimizing for the new generation of reasoning models.

This section is designed as a hands-on tutorial. You won’t just read about prompt engineering techniques – you’ll practice them, compare approaches, and build intuition for what works and why.

What will I get out of this?

By the end of this section, you will be able to:

Explain the concept of prompts and their role in guiding LLM responses.
Describe the key components of an effective prompt, including task instructions, context, and format specifications.
Analyze the impact of essential parameters like temperature and top-P sampling on LLM outputs.
Apply zero-shot, few-shot, chain-of-thought, and structured output techniques in practical scenarios.
Differentiate between traditional prompt engineering techniques and approaches optimized for modern reasoning models (o1, o3, R1).
Evaluate the appropriate use cases for different prompt engineering strategies based on task requirements and model capabilities.

Prompt Engineering: as much Art as Science

Prompt Engineering is a surprisingly complex discipline! Different models, different methods of inference, different tasks – all are criteria that influence the creation of a good prompt. While going into extreme minutiae on this is outside the scope of this course, we’ll cover general good practices and give you hands-on exercises to build intuition.

Ultimately, the best way to craft a good prompt will involve a lot of experimentation and evaluation!

Anatomy of an Effective Prompt

graph LR
    A["Task<br/>Instruction"] --> B["Context &<br/>Background"]
    B --> C["Format<br/>Specification"]
    C --> D["Examples<br/>(Few-Shot)"]
    D --> E["Constraints &<br/>Guardrails"]
    E --> F["Effective<br/>Prompt"]

    style A fill:#1C90F3,color:#fff
    style B fill:#1C90F3,color:#fff
    style C fill:#1C90F3,color:#fff
    style D fill:#1C90F3,color:#fff
    style E fill:#1C90F3,color:#fff
    style F fill:#2d5016,color:#fff

A well-structured prompt typically includes several key components:

Task Instructions:
- Clear, specific directions about what you want
- Example: “Analyze this code for security vulnerabilities”
Context and Background:
- Relevant information the model needs
- Previous conversation history (in chat contexts)
- Example: “Given a Python web application using Flask…”
Format Specifications:
- How you want the output structured
- Example: “Provide your answer in bullet points”

Examples (Few-Shot Learning):

Demonstrations of desired input-output pairs
Helps the model understand patterns

Input: "Hello"
Output: "Hi there! How can I help?"

Input: "What's the weather?"
Output: "I don't have access to current weather data."

Constraints and Guardrails:
- What the model should NOT do
- Output length limits, tone requirements
- Example: “Keep your response under 200 words. Do not include code examples.”

Hands-On: Core Prompting Techniques

Let’s explore the four fundamental prompting techniques with practical exercises. Each technique builds on the last, giving you a toolkit for different situations.

Technique 1: Zero-Shot Prompting

Zero-shot prompting means asking the model to perform a task without providing any examples. You rely entirely on the model’s training to understand what you want.

Classify the following text as POSITIVE, NEGATIVE, or NEUTRAL:

"The new software update fixed several bugs but introduced a frustrating
new UI that makes common tasks take longer."

Classification:

When to use: Simple, well-defined tasks where the model’s training is sufficient. Classification, summarization, translation, and straightforward Q&A.

Try This: Zero-Shot Exercise

**Exercise:** Try these zero-shot prompts in any LLM (ChatGPT, Claude, Gemini, etc.) and compare the results: 1. `"Summarize the concept of machine learning in one sentence for a 10-year-old."` 2. `"Summarize the concept of machine learning in one sentence for a PhD researcher."` 3. `"Summarize machine learning."` **What to notice:** - How does specificity about the audience change the response? - Which prompt gives you the most useful result? - What happens when you add no context at all (prompt 3)? **Key insight:** Even zero-shot prompts benefit enormously from specifying audience, format, and constraints. The difference between a vague prompt and a specific one is often the difference between a mediocre and excellent response.

Technique 2: Few-Shot Prompting

Few-shot prompting provides the model with examples of the desired input-output pattern before presenting the actual task. This is powerful for tasks where you need a specific format or style.

Classify customer feedback and extract the key issue:

Feedback: "Your app crashes every time I try to upload a photo."
Classification: BUG
Key Issue: Photo upload crash

Feedback: "Would love to see a dark mode option."
Classification: FEATURE_REQUEST
Key Issue: Dark mode

Feedback: "The new search feature is amazing! Much faster than before."
Classification: POSITIVE
Key Issue: Search performance improvement

Feedback: "I can't figure out how to change my password. The settings
menu is really confusing."
Classification: ???
Key Issue: ???

When to use: When you need consistent output format, when the task is nuanced, or when zero-shot results aren’t reliable enough.

Try This: Few-Shot Exercise

**Exercise:** Create a few-shot prompt for each of these tasks: 1. **Email prioritization:** Given an email subject line and first sentence, classify as HIGH, MEDIUM, or LOW priority. Create 3 examples, then test with a new email. 2. **Security log analysis:** Given a log entry, classify as NORMAL, SUSPICIOUS, or CRITICAL. Create 3 examples showing the pattern. **Tips for good few-shot examples:** - Cover the range of possible outputs (don't just show positive examples) - Make examples realistic and diverse - Keep example format consistent -- the model will mirror your pattern exactly - 3-5 examples is usually the sweet spot (more isn't always better) **Experiment:** Try the same task with 1 example, 3 examples, and 5 examples. Does quality improve with more examples? At what point do you see diminishing returns?

Technique 3: Chain-of-Thought (CoT) Prompting

Chain-of-thought prompting encourages LLMs to break down complex problems into step-by-step reasoning. Instead of jumping straight to an answer, the model explains its thinking process.

A company has 3 servers. Each server can handle 1000 requests per second.
During peak hours, they receive 2800 requests per second. They want to add
a caching layer that reduces server load by 40%.

After adding the cache, will their current servers handle peak load?

Let's think through this step by step:

1. Current capacity: 3 servers x 1000 requests = 3000 requests/sec
2. Peak demand: 2800 requests/sec
3. Cache reduces load by 40%: 2800 x 0.40 = 1120 requests cached
4. Remaining load after cache: 2800 - 1120 = 1680 requests/sec
5. Available capacity: 3000 requests/sec
6. 1680 < 3000, so yes -- their current servers will handle peak load
   with room to spare (44% headroom).

When to use: Complex reasoning, math problems, multi-step analysis, debugging, and any task where showing work improves accuracy.

The Magic Phrase

Adding “Let’s think through this step by step” to the end of a prompt has been shown to significantly improve accuracy on reasoning tasks with traditional LLMs. This simple addition triggers the model to decompose the problem rather than attempt to answer in one shot.

Try This: Chain-of-Thought Exercise

**Exercise:** Try this security analysis prompt both WITH and WITHOUT chain-of-thought: **Without CoT:** ``` A web application receives a request with the parameter: user_input="; DROP TABLE users; --" Is this a security threat? What kind? ``` **With CoT:** ``` A web application receives a request with the parameter: user_input="; DROP TABLE users; --" Analyze this step by step: 1. What does the input contain? 2. What would happen if this input is passed directly to a SQL query? 3. What type of attack is this? 4. What is the severity? 5. What defenses should be in place? ``` **Compare the results.** The CoT version should provide a more thorough, structured analysis. Notice how the step-by-step structure helps the model cover all relevant aspects.

Technique 4: Structured Output Prompting

Structured output prompting instructs the model to produce responses in a specific format – JSON, XML, tables, or other structured formats. This is essential for programmatic consumption of LLM outputs.

Analyze the following code snippet for security vulnerabilities.
Return your analysis as JSON with the following structure:

{
  "vulnerabilities": [
    {
      "type": "string (e.g., SQL Injection, XSS, CSRF)",
      "severity": "CRITICAL | HIGH | MEDIUM | LOW",
      "line": "number or range",
      "description": "brief explanation",
      "fix": "recommended remediation"
    }
  ],
  "overall_risk": "CRITICAL | HIGH | MEDIUM | LOW",
  "summary": "one-sentence summary"
}

Code:
```python
def login(username, password):
    query = f"SELECT * FROM users WHERE name='{username}' AND pass='{password}'"
    result = db.execute(query)
    return result

**When to use:** API integrations, data pipelines, automated workflows, and any scenario where the LLM output needs to be parsed by code.



  
     
    Try This: Structured Output Exercise
  
  
**Exercise:** Create a structured output prompt for each scenario:

1. **Meeting notes extraction:** Given raw meeting transcript text, extract attendees, action items, decisions, and next steps as JSON.

2. **Threat assessment:** Given a security alert description, produce a structured report with threat type, affected systems, severity, and recommended actions.

**Pro tips for structured output:**
- Provide the exact schema you want (with field names and types)
- Include an example of the expected output format
- Specify what to do when information is missing (use `null`, `"unknown"`, or skip the field?)
- For JSON output, some models support a "JSON mode" that guarantees valid JSON

**Advanced:** Try combining structured output with few-shot examples -- provide 1-2 complete examples of input-to-structured-output, then present the new input.
  


---

## Prompt Engineering for Reasoning Models

Modern reasoning models (like OpenAI's o1, o3, o4-mini, and DeepSeek R1) have built-in multi-step reasoning capabilities that fundamentally change how we should approach prompt engineering. Many of the explicit CoT techniques we just covered may actually **hinder** these models' performance.

### Key Differences from Traditional Models

| Aspect              | Traditional LLMs                    | Reasoning Models                    |
|---------------------|-------------------------------------|-------------------------------------|
| Reasoning Process   | Needs explicit CoT prompting        | Has automatic internal reasoning    |
| Best Prompt Style   | Detailed instructions + examples    | Concise, direct queries            |
| Few-Shot Learning   | Generally improves performance      | Can actually reduce quality         |
| Processing Style    | Single-pass prediction              | Multi-step deliberation            |
| Error Handling      | Requires manual iteration           | Has built-in verification          |

### Optimizing for Reasoning Models


  
    
    
    
  
  
    
      
How to prompt GPT-4o, Claude, Llama, etc.:
Please carefully analyze the following Python code for security
vulnerabilities. Go through it step by step:

1. First, identify all user inputs
2. Then, trace how each input flows through the code
3. Check if any input reaches a dangerous function without sanitization
4. For each vulnerability found, explain the risk and suggest a fix

Here's an example of the analysis format I want:
[... example ...]

Now analyze this code:
[code]
The traditional approach benefits from:

Detailed step-by-step instructions
Examples of expected output
Explicit reasoning structure
Context about the analysis approach

      
    
    
      
How to prompt o3, o4-mini, DeepSeek R1, etc.:
Find all security vulnerabilities in this Python code:
[code]
The reasoning model approach:

Keep it concise – the model will automatically break down the problem
Don’t provide step-by-step instructions – this can interfere with the model’s own reasoning
Skip few-shot examples – they can constrain the model’s approach
State what you want, not how to get there

The model’s internal “thinking” process will:

Automatically decompose the analysis
Consider multiple vulnerability categories
Verify its findings before presenting them
Structure the output logically

      
    
  


> [!warning] Common Mistake
> A frequent error is applying traditional prompting techniques to reasoning models. Telling o3 to "think step by step" is like telling a skilled detective to "remember to look for clues" -- it's unnecessary at best and distracting at worst. The model already knows how to reason; let it do its job.



  
     
    Try This: Traditional vs. Reasoning Model Comparison
  
  
**Exercise:** If you have access to both a traditional LLM and a reasoning model, try this comparison:

**Task:** "A farmer has 17 sheep. All but 9 run away. How many sheep does the farmer have left?"

**Prompt for traditional LLM:**
```
Let's think step by step. A farmer has 17 sheep. All but 9 run away.
How many sheep does the farmer have left?
```

**Prompt for reasoning model:**
```
A farmer has 17 sheep. All but 9 run away. How many are left?
```

**What to notice:**
- Traditional LLMs without CoT often answer "8" (17-9) instead of the correct "9"
- CoT prompting helps traditional LLMs catch the trick ("all but 9" means 9 remain)
- Reasoning models typically get this right immediately without CoT scaffolding
- Over-prompting the reasoning model may actually reduce accuracy on tricky questions
  


---

## Essential Parameters

While understanding prompting techniques is crucial, effectively using LLMs also requires mastering their control parameters. These parameters shape how models generate and process text:

### Temperature
Temperature controls the randomness in the model's responses:
- Low values (e.g., 0.2) --> more predictable, focused responses
- High values (e.g., 0.8) --> more creative, diverse responses

> [!tip] Think of it This Way...
> At low temperatures, LLMs stick to the most probable responses (like saying "the sky is blue"). At higher temperatures, it might get more unpredictable (like "the sky is a canvas painted in azure hues"). It is why it is often correlated with the "creativity" of the model.

### Top-P (Nucleus) Sampling
While temperature affects overall randomness, top-P sampling controls which words the model considers:
- Setting p = 0.9 means only the top 90% most likely tokens are considered
- Lower values --> more focused, conservative text
- Higher values --> more diverse vocabulary

Question: “What color is the sky?” Top-P = 0.5: “blue” (sticks to most common answer) Top-P = 0.9: “azure”, “cerulean”, “sapphire” (considers more options)

### Response Length
Controls how much text the model generates:
- Set by maximum token count
- Longer isn't always better
- Consider context window limits and cost

> [!warning] Context Window Trade-offs
> Remember that longer responses consume more of your context window and increase cost. A 1000-token response means 1000 fewer tokens available for future context in the conversation, and 1000 more tokens on your bill.



  
     
    Try This: Parameter Experimentation
  
  
**Exercise:** Use the same prompt with different parameter settings and compare results.

**Prompt:** "Write a one-paragraph description of artificial intelligence."

**Settings to try:**

| Setting | Temperature | Top-P | Expected Behavior |
|---------|-------------|-------|-------------------|
| Conservative | 0.1 | 0.5 | Factual, predictable, similar across runs |
| Balanced | 0.5 | 0.8 | Good mix of accuracy and variety |
| Creative | 0.9 | 0.95 | More unique phrasing, potentially surprising |
| Maximum | 1.0 | 1.0 | Unpredictable, may include unusual word choices |

**Run each setting 3 times** and notice:
- How much variation is there between runs at each setting?
- At what point does creativity become incoherence?
- Which setting would you choose for a technical document vs. marketing copy?
  


---

## Best Practices Summary


  
    
    
    
  
  
    
      
Clarity and Specificity:

Be explicit about what you want
Specify the audience, format, and constraints
Example: “Generate a Python function that calculates the Fibonacci sequence up to n terms, with type hints and docstring”

Safety and Control:

Include guardrails in system messages
Specify output constraints
Example: “Never generate executable code without safety checks”

Iterative Refinement:

Start with a simple prompt and refine based on results
Test edge cases and failure modes
Keep a prompt library of what works

      
    
    
      
Common Pitfalls:

Overloading context windows with unnecessary information
Mixing multiple unrelated tasks in a single prompt
Assuming the model remembers previous conversations without proper context
Not setting clear boundaries in system messages
Using CoT techniques with reasoning models (let them reason independently)
Providing too many few-shot examples (3-5 is usually optimal)

Format Mistakes:

Ambiguous instructions (“make it better” – better how?)
Missing output format specification
No error handling instructions (“if you’re unsure, say so” vs. making up answers)

      
    
  


> [!note] A Note on API Types
> When implementing LLMs, you'll use either a Completion API (for single-turn interactions) or a Chat API (for multi-turn conversations). Each has strengths for different scenarios. We'll explore these integration patterns in detail in the next section on Inference Techniques, but it's important to consider which API you'll use as it affects how you structure your prompts.



  
     
    Key Takeaways
  
  

An effective prompt includes task instructions, context, format specifications, examples, and constraints
Zero-shot, few-shot, chain-of-thought, and structured output are the four fundamental prompting techniques, each suited to different task complexities
Reasoning models (o1, o3, R1) have built-in deliberation and perform best with concise, direct prompts rather than explicit step-by-step scaffolding
Temperature and Top-P sampling parameters control the randomness and creativity of model outputs
Iterative refinement – starting simple and adjusting based on results – is more effective than attempting a perfect prompt on the first try

  


---

> [!note] Test Your Knowledge
> Ready to test your understanding of prompt engineering? Head to the [quiz](activity/) to check your knowledge.

---

> [!important] Up next
> Now that we've explored how to effectively communicate with AI models through well-crafted prompts, let's dive into the technical approaches for integrating these models into applications. In the next section, we'll examine different inference techniques, RAG pipelines, and cost optimization strategies.

<div class="button-group">
  Previous Section
  Back to Top
  Next Section
</div>