6. Inference Techniques

Introduction

Now that we’ve explored the fundamentals of LLMs, key players, deployment considerations, technical foundations, and the art of prompt engineering, it’s time to dive into how these models actually operate in real-world applications. This section examines the technical aspects of inference – the process where LLMs generate responses to our inputs – focusing on API integration patterns, response handling, knowledge integration with RAG, and cost optimization strategies.

This section is hands-on: you’ll follow along with practical examples, experiment with different approaches, and build understanding of how to make inference work efficiently for real applications.

What will I get out of this?

By the end of this section, you will be able to:

Differentiate between the Completion API and Chat API, describing their features, use cases, and advantages.
Analyze response handling strategies (synchronous vs. streaming) and evaluate their suitability for different applications.
Explain the RAG pipeline – from document ingestion through chunking, embedding, retrieval, and generation.
Describe the concept of embeddings, including how they represent semantic relationships between concepts.
Apply cost optimization techniques including model selection, caching, batching, and hybrid routing.
Build a mental model of the RAG architecture using the data flow diagram to understand how each component contributes.

API Integration Types

When integrating with Large Language Models, you have two primary approaches: the Completion API and the Chat API. While both facilitate interactions with LLMs, they are designed for different types of tasks and use cases.

Completion API

The Completion API is ideal for single-turn interactions, where you send one prompt and receive one response. It offers precise control over the input structure and is best suited for isolated tasks.

Key Features:

Simplicity: Each interaction is self-contained, with no built-in conversation structure.
Flexibility: You can design prompts exactly as needed.
Token Efficiency: Typically uses fewer tokens for standalone tasks.
Use Cases: Generating content, text summarization, one-off queries.

Enhancing Completion API with Templates

Developers often employ templating techniques to improve prompt clarity:

system
You are a helpful coding tutor who explains concepts clearly.

user
What is a for loop?

assistant
A for loop is a control flow statement that repeats a block of code
a specified number of times...

user
Can you give me an example?

assistant

This markdown-like structure separates different parts of the prompt clearly.

### System ###
You are a helpful coding tutor who explains concepts clearly.

### User ###
What is a for loop?

### Assistant ###
A for loop is a control structure used to iterate over a sequence...

### User ###
Can you give me an example?

### Assistant ###

Custom delimiters enable clear section boundaries within your prompt.

Did you Notice?

Leaving the ‘assistant’ section open is one way to ensure the model begins completing following the pattern that preceded it, rather than breaking the expected format or sequence.

Chat API

The Chat API is specifically built for multi-turn, interactive conversations. It organizes messages into discrete units, each assigned a role:

{
  "role": "system",
  "message": "You are a helpful coding tutor who explains concepts clearly."
},
{
  "role": "user",
  "message": "What is a for loop?"
},
{
  "role": "assistant",
  "message": "A for loop is a control structure that allows you to execute a block of code repeatedly."
},
{
  "role": "user",
  "message": "Can you show me an example?"
}

Key Features:

Structured Dialogue: Well-defined message roles (system, user, assistant).
Multi-Turn: Handles conversational exchanges naturally.
Enhanced Clarity: Each message is explicitly tagged.
Use Cases: Chatbots, interactive assistants, back-and-forth dialogue.

Choosing the Right API

Criteria	Completion API	Chat API
Best for	Single tasks, batch processing	Conversations, interactive apps
Context management	Manual (you build the full prompt)	Structured (role-based messages)
Token efficiency	Higher for single queries	Higher for multi-turn conversations
Complexity	Simpler to implement	Better developer experience

Which One Should You Use?

In practice, the Chat API has become the default for most applications in 2025/2026. Even for “single-turn” tasks, the structured role format (system + user) provides cleaner separation of instructions from input. The Completion API is mostly relevant for legacy integrations and specialized use cases.

Response Handling

The way you receive and process responses from an LLM significantly impacts your application’s user experience:

Wait for complete response before processing

Best for batch processing and applications where immediate feedback isn’t critical.

Advantages:

Simpler to implement
Easier to validate and process complete responses
Better for systems that need to analyze full responses before proceeding

Disadvantages:

Longer perceived latency (user sees nothing until complete)
No intermediate feedback

Use when: Processing data pipelines, batch classification, generating structured data, backend processing.

Real-time token delivery as the model generates

Ideal for interactive applications and chat interfaces.

Advantages:

Better user experience with immediate feedback
Allows progressive rendering (text appears word-by-word)
Can implement typing indicators
Users can interrupt if response goes off-track

Disadvantages:

More complex to implement
Requires handling partial responses
Additional error handling for interrupted streams

Use when: Chat interfaces, interactive assistants, real-time writing assistance, any user-facing application.

Knowledge Integration Strategies

When working with LLMs, you’ll need to decide how to provide the model with the information it needs. There are three main approaches, each suited for different scenarios:

Base Model Only – Use the model’s training knowledge directly. Simple but limited to training cutoff date.
In-Prompt Static Data – Include necessary information directly in the context window. Good for small, contained datasets.
Retrieval-Augmented Generation (RAG) – Dynamically retrieve relevant information from external sources. The foundation of most enterprise AI applications.

Why is RAG Important?

Retrieval-Augmented Generation in its many flavors is the foundation of most current enterprise applications. It allows the LLM to access proprietary data, records, and archives that are variable or frequently updated, rather than data that remains static long term. The latter could potentially be baked into a model by fine-tuning, but it would be impractical and costly to do it regularly for data that changes often.

The RAG Pipeline: A Practical Walkthrough

RAG bridges the gap between an LLM’s static training knowledge and the dynamic information your application needs. Let’s walk through exactly how it works.

How RAG Works: The Data Flow

flowchart LR
    subgraph Ingestion["1. Document Ingestion"]
        A[Documents] --> B[Chunking]
        B --> C[Embedding Model]
        C --> D[(Vector Database)]
    end

    subgraph Query["2. Query Time"]
        E[User Query] --> F[Embed Query]
        F --> G[Similarity Search]
        D --> G
        G --> H[Relevant Chunks]
    end

    subgraph Generation["3. Response Generation"]
        H --> I[Build Prompt]
        E --> I
        I --> J[LLM]
        J --> K[Response]
    end

    style A fill:#2d5016,color:#fff
    style D fill:#1a3a5c,color:#fff
    style J fill:#4a1a5c,color:#fff
    style K fill:#2d5016,color:#fff

Step 1: Document Ingestion

Before your RAG system can answer questions, it needs to process and store your documents:

Chunking: Documents are split into smaller pieces (chunks). This is critical because:

LLMs have context window limits – you can’t send entire documents
Smaller, focused chunks lead to more relevant retrieval
Chunk size affects both retrieval quality and cost

Deep Dive: Chunking Strategies

| Strategy | Chunk Size | Best For | Trade-off | |----------|-----------|----------|-----------| | **Fixed-size** | 256-512 tokens | General-purpose, simple implementation | May split mid-sentence or mid-thought | | **Sentence-based** | 1-5 sentences | FAQ systems, factual retrieval | Chunks may be too small for complex topics | | **Paragraph-based** | Natural paragraphs | Technical documentation, articles | Uneven chunk sizes, some may be too large | | **Semantic** | Varies | Research papers, complex documents | More complex to implement, requires NLP processing | | **Overlapping** | Any + 10-20% overlap | Preventing context loss at chunk boundaries | Increases storage and processing cost | **Rule of thumb:** Start with fixed-size chunks of 256-512 tokens with 10% overlap. Adjust based on retrieval quality. Most enterprise RAG systems end up with a chunking strategy tailored to their specific document types.

Embedding: Each chunk is converted into a vector (embedding) that captures its semantic meaning. The embedding model (like OpenAI’s text-embedding-3-small or open-source alternatives like BGE-M3) converts text into a high-dimensional numerical representation where similar meanings are near each other in vector space.

Storage: The embeddings are stored in a vector database (Pinecone, Weaviate, Chroma, pgvector, etc.) along with the original text chunks and any metadata.

Step 2: Query Time

When a user asks a question:

Embed the query: The same embedding model converts the user’s question into a vector
Similarity search: The vector database finds the chunks whose embeddings are most similar to the query embedding (using cosine similarity or other distance metrics)
Retrieve top results: Typically the top 3-10 most relevant chunks are returned

Step 3: Response Generation

The retrieved chunks are combined with the user’s question into a prompt:

You are a helpful assistant. Use the following context to answer the
user's question. If the answer is not in the context, say so.

Context:
[Retrieved chunk 1]
[Retrieved chunk 2]
[Retrieved chunk 3]

User question: What is the company's return policy for electronics?

The LLM then generates a response grounded in the retrieved information, significantly reducing hallucination risk.

Try This: Mental Model Exercise

**Exercise:** Without building anything, trace through this RAG scenario mentally: **Scenario:** A company has 500 pages of product documentation. A customer asks: "Does the XR-500 support Bluetooth 5.0?" 1. **During ingestion:** How would you chunk product documentation? (By product? By feature? By page?) 2. **At query time:** What would the embedding of this question be close to in vector space? 3. **Retrieval:** What chunks would likely be retrieved? What chunks might be incorrectly retrieved? 4. **Generation:** What should the prompt look like? How do you handle the case where no chunk mentions Bluetooth? **Key insight:** The quality of RAG depends on every stage of the pipeline. Bad chunking leads to irrelevant retrieval. Poor embeddings lead to semantic mismatches. And the final prompt engineering determines how well the LLM uses the retrieved context.

Technical Components

Vector Operations and Embeddings

To understand how LLMs and RAG systems process text, we need to grasp two fundamental concepts: vectors and embeddings.

What is a Vector?

A vector is simply a list of numbers that can represent a point in space:

A 2D vector [3, 4] represents a point 3 units east and 4 units north
A 3D vector [1, 2, 3] represents a point in three-dimensional space

In LLMs, we use vectors with hundreds or thousands of dimensions. But at their core, vectors are just containers for numbers – they don’t inherently mean anything.

Think of it This Way…

Think of vectors like barcodes – they’re just sequences of numbers that don’t mean anything on their own. Embeddings are like barcodes that have been configured to represent specific products. When you scan a barcode, the numbers suddenly have meaning because they’ve been trained to represent that product.

All embeddings are vectors, but not all vectors are embeddings!

What is an Embedding?

An embedding is a specific use of vectors for representing meaning. What makes embeddings special is that:

They are learned through training – the model learns what numbers to put in each vector
They have meaningful relationships – similar concepts get similar numbers

# These same vectors become embeddings when trained to represent words:
"cat"  = [0.2, 0.5, 0.1]  # Embedding for "cat"
"dog"  = [0.3, 0.4, 0.2]  # Embedding for "dog"

# Their similarity reflects that both are pets
# Their difference reflects that they're different species

Key Takeaway

The power of embeddings lies in how they capture relationships between concepts. Just like how we naturally understand that “kitten” is related to “cat” and “puppy” is related to “dog,” embeddings allow AI models to understand these connections through carefully chosen numbers.

Deep Dive: How Embedding Dimensions Work

To understand dimensions, imagine each dimension represents a trait -- the first being the number of legs, the second being size, the third being domestication level, and so on. In a trained embedding space: - The "cat" embedding is close to "kitten" (similar concept) - It's somewhat close to "dog" (both are pets) - It's far from "airplane" (unrelated concept) - The difference between "cat" and "kitten" represents "young animal" **Reality check:** In practice, dimensions don't correspond to clear-cut features. They work together in complex ways to capture semantic relationships. Modern embedding models use 256-3072 dimensions -- the more dimensions, the more subtle nuances can be captured. **Current embedding models (2025/2026):** | Model | Dimensions | Provider | Notes | |-------|-----------|----------|-------| | text-embedding-3-large | 3072 | OpenAI | Highest quality, adjustable dimensions | | text-embedding-3-small | 1536 | OpenAI | Good balance of quality and cost | | BGE-M3 | 1024 | Open-source | Strong multilingual, self-hostable | | Cohere Embed v3 | 1024 | Cohere | Optimized for search and retrieval | | Gemini Embedding | 768 | Google | Integrated with Google AI ecosystem |

Embeddings: Beyond RAG

Embeddings are not used exclusively for RAG or vector databases! They are fundamental to how all modern LLMs work internally. Even when using a “base model only” approach with no external retrieval, the model is constantly creating and manipulating embeddings internally.

Also, while we’ve focused on text embeddings, similar techniques are used for images, audio, and multimodal AI systems.

Cost Optimization Techniques

One of the most practical aspects of inference is managing costs. Token-based pricing means every interaction has a direct cost, and these costs can escalate quickly at scale.

The Cost Optimization Decision Tree

flowchart TD
    A[Incoming Request] --> B{Simple or Complex?}
    B -->|Simple| C{Cached?}
    B -->|Complex| D{Needs latest data?}

    C -->|Yes| E[Return cached response]
    C -->|No| F[Use small model<br/>GPT-4o mini / Haiku / Flash]

    D -->|Yes| G[RAG + capable model<br/>GPT-4o / Sonnet]
    D -->|No| H{Needs reasoning?}

    H -->|Yes| I[Reasoning model<br/>o3 / R1]
    H -->|No| G

    F --> J[Cache response]
    G --> J
    I --> J

    style E fill:#2d5016,color:#fff
    style F fill:#1a3a5c,color:#fff
    style G fill:#4a1a5c,color:#fff
    style I fill:#8b0000,color:#fff

Key Optimization Strategies

Use the right model for each task

Not every request needs GPT-4o or Claude Opus 4. Model selection by task complexity is the single biggest cost lever:

Task Type	Recommended Tier	Cost Ratio
Classification, routing	Small (GPT-4o mini, Haiku)	1x
Summarization, Q&A	Mid (GPT-4o, Sonnet)	10-20x
Complex analysis	Large (Claude Opus 4, GPT-4o)	30-50x
Mathematical reasoning	Reasoning (o3, R1)	50-100x

Rule of thumb: Start with the smallest model that produces acceptable results, then scale up only where quality demands it.

Don’t generate the same response twice

Caching saves time and money by reusing previous answers:

Exact match caching: If the same question is asked again, return the cached answer
Semantic caching: If a similar question is asked, return a cached answer that’s close enough
Prompt caching: Some providers (Anthropic, Google) offer server-side caching of long system prompts at reduced cost

Best for: FAQ-style applications, customer support, common queries where answers don’t change frequently.

Imagine a tour guide who memorizes answers to frequently asked questions. When asked “When was this building constructed?” they can answer immediately without looking it up again.

Process efficiently at scale

Batching: Group multiple requests together. Some providers offer batch APIs at 50% discount for non-time-sensitive processing. Think of baking multiple trays of cookies at once.
Request Throttling: Control request rate to prevent cost spikes and stay within budget. Like highway metering lights that prevent traffic jams.
Early Stopping: Stop generation once a sufficient answer is produced. A yes/no question doesn’t need a 500-word response. Like tasting soup – once it’s right, stop cooking.

Minimize token usage without sacrificing quality

Concise system prompts: Every token in your system prompt is paid for on every request
Selective context: In RAG, retrieve only the most relevant chunks (3-5, not 10-20)
Output length limits: Set appropriate max_tokens to avoid verbose responses
Compression: Summarize conversation history instead of including full transcripts

Quick math: If your system prompt is 500 tokens and you make 100,000 requests/day:

500 * 100,000 = 50M input tokens/day
At GPT-4o input pricing ($2.50/1M): $125/day just for the system prompt
Cutting that prompt to 200 tokens saves $75/day ($27,000/year)

Try This: Cost Estimation Exercise

**Exercise:** Estimate the monthly cost for this application: **Scenario:** A customer support chatbot that handles 10,000 conversations per day. Each conversation averages 5 turns (5 user messages + 5 assistant responses). Average input per turn: 200 tokens (including context). Average output per turn: 150 tokens. **Calculate for two approaches:** **Approach A: GPT-4o for everything** - Input tokens per day: 10,000 conversations x 5 turns x 200 tokens = ? - Output tokens per day: 10,000 conversations x 5 turns x 150 tokens = ? - Monthly cost: ? **Approach B: Hybrid (GPT-4o mini for simple queries, GPT-4o for complex)** - Assume 70% of queries are simple, 30% are complex - Calculate each tier separately - Monthly cost: ? **The difference between these two approaches represents the value of intelligent model routing.**

Key Takeaways

Inference is the process where a trained LLM generates responses, using either the Completion API (single-turn) or Chat API (multi-turn conversations)
RAG (Retrieval-Augmented Generation) pipelines combine document ingestion, embedding, similarity search, and LLM generation to ground responses in external knowledge
Embeddings are learned vector representations that capture semantic relationships between concepts – fundamental to both RAG systems and LLM internal processing
Cost optimization strategies include model-task matching, caching, batching, and prompt optimization – the choice of model tier is the single biggest cost lever

Test Your Knowledge

Ready to test your understanding of inference techniques? Head to the quiz to check your knowledge.

Up next

Now that we understand how LLMs operate in practice – from API integration through RAG pipelines to cost optimization – we’re ready to explore the cutting edge: agentic AI systems that don’t just respond to prompts but take autonomous action in the real world.

Previous Section Back to Top Next Section