{
  "name": "Lab 3: RAG Pipeline",
  "nodes": [
    {
      "parameters": {},
      "id": "trigger-003",
      "name": "Start Lab",
      "type": "n8n-nodes-base.manualTrigger",
      "typeVersion": 1,
      "position": [100, 300],
      "notes": "Click 'Execute Workflow' to run the simplified RAG pipeline. This lab demonstrates the Retrieve-then-Generate pattern."
    },
    {
      "parameters": {
        "values": {
          "string": [
            {
              "name": "user_question",
              "value": "What is the company's policy on remote work?"
            },
            {
              "name": "doc_1",
              "value": "Company Policy: Remote Work\nAll full-time employees are eligible for hybrid remote work arrangements. Employees may work remotely up to 3 days per week with manager approval. Remote work days must be scheduled in advance using the HR portal. A stable internet connection and a dedicated workspace are required. Fully remote positions are available for specific roles as designated by department heads."
            },
            {
              "name": "doc_2",
              "value": "Company Policy: Security Requirements\nAll employees must complete annual cybersecurity awareness training. Devices used for work must have endpoint protection software installed. Multi-factor authentication (MFA) is required for all company systems. Employees must report suspected phishing emails to security@company.com within 24 hours. Remote workers must use the company VPN when accessing internal resources."
            },
            {
              "name": "doc_3",
              "value": "Company Policy: Benefits Overview\nFull-time employees receive health insurance, dental coverage, and vision benefits starting on their first day. The company matches 401(k) contributions up to 4% of salary. Employees receive 20 days of PTO annually plus 10 company holidays. Professional development budget of $2,000 per year is available for courses, conferences, and certifications."
            },
            {
              "name": "doc_4",
              "value": "Company Policy: AI Usage Guidelines\nEmployees may use approved AI tools (ChatGPT Enterprise, GitHub Copilot, internal AI assistant) for work tasks. Sensitive data including customer PII, financial records, and proprietary code must never be entered into external AI tools. AI-generated code must be reviewed by a human before deployment. All AI tool usage is logged for compliance purposes."
            }
          ]
        },
        "options": {}
      },
      "id": "set-docs-003",
      "name": "Document Store (Sample Data)",
      "type": "n8n-nodes-base.set",
      "typeVersion": 1,
      "position": [320, 300],
      "notes": "These are sample company policy documents simulating a document store.\n\nIn a real RAG system, these would be stored as embeddings in a vector database (Pinecone, Chroma, pgvector). Here we use plain text for simplicity.\n\nYou can modify the user_question to test different queries against these documents."
    },
    {
      "parameters": {
        "functionCode": "// STUDENT TASK: Build the query construction and mock retrieval logic\n// \n// This function simulates the retrieval step of a RAG pipeline.\n// In a real system, this would:\n//   1. Convert the query to an embedding\n//   2. Search a vector database for similar document embeddings\n//   3. Return the top-k most relevant chunks\n//\n// For this lab, implement simple keyword-based retrieval:\n//   1. Extract key terms from the user's question\n//   2. Score each document based on keyword matches\n//   3. Return the top 2 most relevant documents\n\nconst question = $input.item.json.user_question.toLowerCase();\nconst docs = [\n  { id: 'doc_1', content: $input.item.json.doc_1, title: 'Remote Work Policy' },\n  { id: 'doc_2', content: $input.item.json.doc_2, title: 'Security Requirements' },\n  { id: 'doc_3', content: $input.item.json.doc_3, title: 'Benefits Overview' },\n  { id: 'doc_4', content: $input.item.json.doc_4, title: 'AI Usage Guidelines' }\n];\n\n// STUDENT TASK: Replace the scoring logic below.\n// Currently it returns all documents. Make it smarter by:\n// 1. Extracting keywords from the question\n// 2. Scoring documents based on keyword overlap\n// 3. Returning only the top 2 most relevant documents\n\n// Hint: Split the question into words, remove common words (the, is, what, etc.),\n// then count how many of those keywords appear in each document's content.\n\nconst scoredDocs = docs.map(doc => {\n  // TODO: Implement your scoring logic here\n  const score = 1; // Replace with actual relevance score\n  return { ...doc, score };\n});\n\n// Sort by score descending and take top 2\nconst topDocs = scoredDocs\n  .sort((a, b) => b.score - a.score)\n  .slice(0, 2);\n\nreturn [{\n  json: {\n    user_question: $input.item.json.user_question,\n    retrieved_context: topDocs.map(d => d.content).join('\\n\\n---\\n\\n'),\n    retrieved_titles: topDocs.map(d => d.title),\n    retrieval_scores: topDocs.map(d => ({ title: d.title, score: d.score }))\n  }\n}];"
      },
      "id": "func-retrieve-003",
      "name": "Mock Retrieval (Similarity Search)",
      "type": "n8n-nodes-base.function",
      "typeVersion": 1,
      "position": [540, 300],
      "notes": "STUDENT TASK: Implement the keyword-based retrieval logic.\n\nThis simulates vector similarity search. In production, you'd use:\n- An embedding model to convert query to a vector\n- Cosine similarity against document vectors in a vector DB\n- Return top-k results\n\nHere, implement a simple keyword matching approach to understand the retrieval concept."
    },
    {
      "parameters": {
        "values": {
          "string": [
            {
              "name": "rag_prompt",
              "value": "STUDENT TASK: Build your RAG prompt template below. Replace this entire text.\n\nYour prompt should:\n1. Set a role/context for the AI (e.g., 'You are a helpful company policy assistant')\n2. Include the retrieved context: {{ $json.retrieved_context }}\n3. Include the user's question: {{ $json.user_question }}\n4. Instruct the AI to ONLY answer based on the provided context\n5. Tell the AI what to do if the answer is NOT in the context\n\nExample structure:\n---\nYou are [role]. Use the following context to answer the user's question.\nIf the answer is not in the provided context, say so.\n\nContext:\n[retrieved documents]\n\nQuestion: [user question]\n\nAnswer:"
            }
          ]
        },
        "options": {}
      },
      "id": "set-ragprompt-003",
      "name": "Build RAG Prompt",
      "type": "n8n-nodes-base.set",
      "typeVersion": 1,
      "position": [760, 300],
      "notes": "STUDENT TASK: Construct the RAG prompt template.\n\nThis is where you combine the retrieved context with the user's question into a prompt that the LLM can answer accurately.\n\nKey principles:\n- Clearly separate context from question\n- Instruct the LLM to use ONLY the provided context\n- Specify what to do when context doesn't contain the answer\n- Use the expressions {{ $json.retrieved_context }} and {{ $json.user_question }}"
    },
    {
      "parameters": {
        "method": "POST",
        "url": "https://api.openai.com/v1/chat/completions",
        "authentication": "genericCredentialType",
        "genericAuthType": "httpHeaderAuth",
        "sendBody": true,
        "specifyBody": "json",
        "jsonBody": "={{ JSON.stringify({ model: 'gpt-4o-mini', messages: [{ role: 'user', content: $json.rag_prompt }], temperature: 0.2 }) }}",
        "options": {}
      },
      "id": "http-generate-003",
      "name": "LLM Generation (with Context)",
      "type": "n8n-nodes-base.httpRequest",
      "typeVersion": 4,
      "position": [980, 300],
      "notes": "This sends your RAG prompt (with retrieved context) to the LLM.\n\nLow temperature (0.2) is used because we want factual answers grounded in the documents, not creative interpretation.\n\nCompare this output with what the LLM would say WITHOUT the context -- that's the power of RAG."
    },
    {
      "parameters": {
        "values": {
          "string": [
            {
              "name": "answer",
              "value": "={{ $json.choices[0].message.content }}"
            },
            {
              "name": "sources",
              "value": "={{ $('Mock Retrieval (Similarity Search)').item.json.retrieved_titles.join(', ') }}"
            }
          ]
        },
        "options": {}
      },
      "id": "set-output-003",
      "name": "Format RAG Output",
      "type": "n8n-nodes-base.set",
      "typeVersion": 1,
      "position": [1200, 300],
      "notes": "Displays the final answer along with which source documents were used.\n\nThis transparency about sources is a key advantage of RAG -- you can trace answers back to specific documents, unlike base model knowledge which is opaque."
    }
  ],
  "connections": {
    "Start Lab": {
      "main": [
        [
          {
            "node": "Document Store (Sample Data)",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Document Store (Sample Data)": {
      "main": [
        [
          {
            "node": "Mock Retrieval (Similarity Search)",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Mock Retrieval (Similarity Search)": {
      "main": [
        [
          {
            "node": "Build RAG Prompt",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Build RAG Prompt": {
      "main": [
        [
          {
            "node": "LLM Generation (with Context)",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "LLM Generation (with Context)": {
      "main": [
        [
          {
            "node": "Format RAG Output",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  },
  "settings": {
    "executionOrder": "v1"
  },
  "meta": {
    "instanceId": "lab-template"
  }
}
