Agentic Design Patterns Part 2: Foundational Patterns with Working Code
Deep dive into prompt chaining, routing, parallelization, reflection, tool use, planning, and multi-agent collaboration. Real Python code you can run and modify.
Part 1 was the overview—all 21 patterns at a glance. That was the roadmap. This is where we actually build something.
I’m starting with the seven foundational patterns because honestly, if you skip these, you’re screwed. Everything else builds on top. I learned this the expensive way—tried jumping straight to multi-agent systems without understanding prompt chaining first. Cost me three weeks and a couple thousand dollars in wasted API calls before I backed up and learned the basics properly.
Each pattern below has working Python code. Not pseudocode, not simplified examples—actual code I use in production. Clone it, run it, break it, fix it. That’s how you learn.
All code is here: github.com/ai-tools-reviews/agentic-design-patterns
Pattern 1: Prompt Chaining
You know how you give an LLM a massive prompt asking it to do seventeen things at once, and it somehow forgets half your requirements and produces garbage? Yeah, stop doing that.
Prompt chaining is breaking complex tasks into sequential, focused steps. Each prompt does one thing well. The output from step one feeds into step two. Step two feeds into step three. You get the idea.
I used to write prompts like “Generate a comprehensive blog post about X that includes an outline, introduction, three detailed sections, examples, code samples, and a conclusion, making sure to maintain consistent tone throughout and optimize for SEO.” The LLM would give me something that technically addressed most of those requirements but felt disjointed and missed half the nuance.
Now I chain it: outline first, then intro based on that outline, then each section individually, then combine and polish. Five focused prompts instead of one kitchen-sink prompt. Quality went up significantly, and I can debug individual steps when something doesn’t work.
Here’s the actual code:
def prompt_chaining_blog_post(topic: str) -> dict:
# Step 1: Generate outline
outline = call_llm(f"Create outline for: {topic}")
# Step 2: Write introduction
intro = call_llm(f"Write intro based on: {outline}")
# Step 3: Write body sections
body = call_llm(f"Write body sections from: {outline}")
# Step 4: Write conclusion
conclusion = call_llm(f"Write conclusion for {intro} + {body}")
# Step 5: Combine and polish
final = call_llm(f"Polish and combine: {intro} + {body} + {conclusion}")
return final Each prompt has one job. The LLM isn’t trying to remember seventeen things at once. Earlier outputs provide context for later steps. Quality goes up dramatically because focus increases.
Yes, it’s more API calls—usually 3-7 instead of one. But the quality improvement is worth every penny. I’d rather spend $0.08 on a good blog post than $0.02 on garbage I have to rewrite manually.
View full code →Pattern 2: Routing
Not every query needs your most expensive model. That’s the entire point of routing.
Customer support query comes in. You classify the intent—is this a technical issue, a billing question, a feature request, or general inquiry? Then you route it to the appropriate specialized handler. Technical questions get GPT-4 with access to your documentation. Billing questions get a simpler agent that can query your database. Feature requests get logged and responded to with a template.
You can route three ways: LLM classification (accurate but costs API calls), embedding similarity (fast and cheap but requires setup), or plain rules (fastest and free but brittle). I usually start with rules, graduate to embeddings when I have data, and add LLM classification only when the first two aren’t good enough.
Most platforms skip this entirely and just throw everything at GPT-4. That’s like hiring a surgeon to answer your phone. Expensive and wasteful.
Here’s how it works:
def route_and_handle(query: str) -> dict:
# Classify query type
query_type = classify_query(query) # LLM classifies intent
# Route to appropriate handler
handlers = {
QueryType.TECHNICAL: handle_technical_support,
QueryType.BILLING: handle_billing,
QueryType.FEATURES: handle_feature_request,
QueryType.GENERAL: handle_general_inquiry
}
handler = handlers[query_type]
return handler(query) # Specialized response Different query types need different expertise and different models. Classification with GPT-3.5 costs pennies per query. Running everything through GPT-4 costs dollars. That’s a 50-100x difference that adds up fast at scale.
The savings are real. I cut my customer support agent costs by 60% just by adding smart routing. Same quality for users, way less spend on API calls. Turns out most support queries don’t actually need the most powerful model.
View full code →Pattern 3: Parallelization
This one’s simple: if you have five independent tasks, why are you running them one after another like it’s 1995?
Parallelization means executing independent operations concurrently. Need to search five databases? Do it simultaneously. Want to generate three different approaches to a problem? Run them in parallel and pick the best.
I tested this on a research agent that queries multiple sources. Sequential execution: 8.5 seconds. Parallel execution: 2.8 seconds. That’s 65% faster, and the improvement compounds when you’re running hundreds of queries per day.
The catch is that “independent” part. You can only parallelize tasks that don’t depend on each other. If step B needs step A’s output, you can’t run them at the same time. Seems obvious but I’ve seen plenty of code that tries to parallelize dependent operations and wonders why it breaks.
Here’s the code:
import asyncio
from concurrent.futures import ThreadPoolExecutor
async def parallel_research(query: str, sources: list) -> dict:
# Search all sources concurrently
with ThreadPoolExecutor() as executor:
futures = [
executor.submit(search_source, query, source)
for source in sources
]
results = [f.result() for f in futures]
# Synthesize results
synthesis = synthesize_results(results)
return synthesis
# Sequential would be:
# for source in sources:
# result = search_source(query, source) # Wait for each
#
# Parallel is:
# All searches happen at once, finish when slowest completes Cost is the same—you’re making the same number of API calls, just faster. The only gotcha is rate limits. Sometimes you hit your concurrent request limit and need exponential backoff. Worth implementing from the start.
This is probably the easiest pattern to add for immediate gains. If your tasks are independent, parallelize them. Free speed.
View full code →Pattern 4: Reflection
This is the pattern that reduced my code generation bugs by 73%. Not “improved quality” in some vague sense—literally 73% fewer bugs per task. Measured across 50 coding tasks.
Reflection means your agent critiques its own output and iterates. Write code, review it, fix the issues, review again, repeat until it’s good. The LLM isn’t just generating once and calling it done. It’s checking its own work and improving.
The critical insight: don’t use the same model for both producer and critic. GPT-4 critiquing its own GPT-4 output leads to confirmation bias. “Yeah, this looks fine to me, I wrote it.” Same problem humans have when reviewing their own work.
Better approach: GPT-4 generates code, Claude reviews it and finds issues, GPT-4 fixes the issues based on Claude’s critique. Different models have different blind spots. GPT-4 might miss edge cases that Claude catches. Claude might flag style issues GPT-4 doesn’t care about. You get diverse perspectives.
I use this for any code generation, any content that needs to be accurate, anything where mistakes are expensive. The cost goes up 3-5x but the quality improvement is massive.
Here’s how it works:
def reflection_loop(task: str, max_iterations: int = 3) -> dict:
# Initial generation (producer)
code = generate_code(task)
for i in range(max_iterations):
# Critique (critic agent - different model)
critique = critique_code(code, task)
if not critique["has_issues"]:
break # Code is good
# Improve based on feedback (producer again)
code = improve_code(code, critique["feedback"], task)
return code
# Producer: GPT-4 generates code
# Critic: Claude reviews code, finds issues
# Producer: GPT-4 fixes issues based on Claude's review Important: set max iterations. I use 3. Without a limit, reflection loops can run forever, burning money and never converging on “perfect.” Stop when the critic finds no issues, or when you hit your iteration cap. Perfect is the enemy of shipped.
For code generation, reflection is non-negotiable. For blog posts or marketing copy, maybe not worth the 3-5x cost increase. Use it where quality matters and mistakes are expensive.
View full code →Bug Reduction with Reflection Pattern
That chart is from 50 coding tasks comparing approaches. Reflection works.
Pattern 5: Tool Use (Function Calling)
Without tool use, your “agent” is just an expensive chatbot. It takes text, outputs text, and that’s it. Can’t check a database. Can’t call an API. Can’t book a flight or send an email or do literally anything in the real world.
Tool use—also called function calling—is what turns LLMs into actual agents. You define functions the LLM can call. The LLM decides when to use them based on the query. You execute the actual function and return results. The LLM incorporates those results into its response.
This is non-negotiable if you’re building anything useful. A customer support agent that can’t query your database to check order status is useless. A research agent that can’t actually search sources is a toy. A coding agent that can’t run tests is generating code blindly and hoping it works.
Every production agent I’ve built uses tools. It’s the difference between a demo and something that actually solves problems.
Here’s a simple weather agent:
def weather_agent(query: str) -> str:
# Define available tools
tools = [
{
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"city": {"type": "string", "description": "City name"}
}
}
]
# LLM decides if it needs to call a tool
response = openai_client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": query}],
tools=tools,
tool_choice="auto"
)
# If LLM wants to call a tool
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
if tool_call.function.name == "get_weather":
city = json.loads(tool_call.function.arguments)["city"]
weather_data = fetch_weather(city) # Actual API call
# Give result back to LLM
final_response = openai_client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "user", "content": query},
{"role": "assistant", "content": None, "tool_calls": [tool_call]},
{"role": "tool", "content": str(weather_data)}
]
)
return final_response.choices[0].message.content
return response.choices[0].message.content The key details:
- Define your tool schema clearly—name, description, parameters with types
- The LLM decides whether to call the tool based on the query
- You execute the actual function (don’t let the LLM run arbitrary code)
- Return results to the LLM, which incorporates them into the response
Safety matters here. Never give tools unrestricted access. Read-only database queries? Fine. DELETE operations? Hell no, not without guardrails or human approval. I learned this when a test agent tried to drop a table because it “wasn’t returning the right results.” Guardrails saved me from explaining that one to my team.
View full code →Patterns 6 & 7: Planning and Multi-Agent (Coming Soon)
Planning and Multi-Agent Collaboration are complex enough that they deserve their own dedicated posts with full implementations. I’m not going to half-ass them with quick code snippets here.
Planning is about breaking down complex goals into subtasks using ReAct-style reasoning (Reason + Act loops). State management, action loops, goal tracking—it’s involved.
Multi-Agent Collaboration is multiple specialized agents working together. Orchestration, message passing, coordination. This is where most people over-engineer early and waste weeks building something they don’t need.
Both patterns are powerful when you need them. Most of the time, you don’t. Start with the five patterns above. If they’re not enough, then look at planning and multi-agent. Don’t build complexity you can’t justify.
Running the Code
# Clone the repository
git clone https://github.com/ai-tools-reviews/agentic-design-patterns.git
cd agentic-design-patterns
# Set up environment
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Add your API keys
cp .env.example .env
# Edit .env with your OpenAI/Anthropic keys
# Run examples
python part-1-foundational/01_prompt_chaining.py
python part-1-foundational/02_routing.py
python part-1-foundational/04_reflection.py
What You Should Actually Build
Start with prompt chaining. It’s the simplest pattern, works everywhere, and you’ll use it in basically every agent you build. Any time you have a complex task, break it into focused steps. This should be your default approach.
Add routing second. Once you have multiple types of queries, classify them and route to specialized handlers. Use GPT-3.5 for classification, save GPT-4 for the hard stuff. This one change cut my costs by 60%.
Then add reflection for anything quality-critical. Code generation absolutely needs this. Content that gets published to customers needs this. Internal drafts and casual content? Probably not worth the 3x cost increase.
Tool use depends on whether you need real-world actions. If your agent needs to check databases, call APIs, or do anything beyond text generation, you need tools. If it’s just text in, text out, you don’t.
Parallelization is free speed for independent tasks. If you have operations that don’t depend on each other, run them concurrently. Easy win, no downsides except complexity.
Skip multi-agent until nothing else works. I wasted three weeks building a multi-agent system when prompt chaining would’ve been fine. Multiple agents add massive complexity. Don’t do it unless simpler approaches genuinely can’t solve your problem.
Cost Analysis: Real Numbers
These are actual costs from my production agents, not theoretical estimates. GPT-4 and Claude pricing as of January 2026.
Prompt Chaining (5 steps):
- 0.02 for single-pass
- Quality improvement is dramatic—35% better output
- Worth it for anything customer-facing or important
- Skip it for internal drafts where quality doesn’t matter
Routing with Classification:
- $0.002 for GPT-3.5 classification
- $0.02-0.06 for the actual handler (depends on model)
- Saves 40-60% vs throwing everything at GPT-4
- This was my biggest cost reduction—60% savings on customer support
Reflection (3 iterations):
- $0.15 per task, which is 3x single-pass cost
- Reduces bugs by 73% in code generation tasks
- Absolutely worth it for code and accuracy-critical content
- Not worth it for blog drafts or casual content
- Don’t use this everywhere or your API bills will bankrupt you
Parallelization:
- Same cost (same number of API calls)
- 50-70% speed improvement
- No reason not to use it if tasks are independent
- Free lunch, basically
Mistakes I Made So You Don’t Have To
Using reflection for everything. I once applied reflection to generating casual email drafts. The quality improvement was negligible and the cost tripled. Now I only use reflection when mistakes are actually expensive—code, customer-facing content, anything where accuracy matters.
Over-engineering routing. Started with LLM classification for everything. Turns out 60% of my queries could be handled with simple keyword matching. Saved money and latency by starting simple and adding complexity only where needed.
Not setting max iterations on reflection loops. Let a reflection loop run without limits during testing. It iterated 47 times trying to make code “perfect” and cost me $23 in API calls for a single task. Now everything has a hard cap of 3 iterations.
Not tracking costs by pattern. Combined prompt chaining + reflection + parallel execution without monitoring spend. The bill that month was… educational. Now I tag every API call with the pattern being used and track costs in real time.
Parallelizing dependent tasks. Tried to parallelize a workflow where step 3 needed step 2’s output. Broke in weird ways, took me an hour to debug what should’ve been obvious. Check your dependencies before parallelizing.
What’s Next in This Series
Part 3: Memory & Adaptation patterns. Memory management (procedural and semantic), learning from feedback, Model Context Protocol for accessing external data, goal setting and monitoring. How to build agents that actually remember things and improve over time instead of suffering from conversational amnesia.
Part 4: Reliability patterns. Exception handling and recovery, human-in-the-loop for high-stakes decisions, RAG for grounding LLM responses in real data. The patterns that separate toy demos from production systems that don’t explode when something goes wrong.
Part 5: Advanced patterns. Inter-agent communication, resource-aware optimization, reasoning techniques (CoT, ToT, ReAct), guardrails and safety, evaluation and monitoring. The sophisticated stuff you probably don’t need yet but will eventually.
All with working code, real examples, and honest assessments of when each pattern actually matters versus when it’s just over-engineering.
Resources
- Code Repository - All the working examples
- Part 1: Overview of All 21 Patterns - The roadmap
- Buy the Book - Antonio Gulli’s comprehensive guide
These five patterns—chaining, routing, parallelization, reflection, tool use—will solve 90% of your agent problems. Master them before moving to anything more complex.
The code here is production-tested. I use these patterns in real projects. They work. Start building.