The AI Engineering Handbook

1. The 3-Layer AI Architecture

The Problem

Most AI apps are exactly one thing: a frontend, an API call, and a model. When something breaks—and it will—the entire product dies.

Real production systems have layers. Each one solves a specific problem before requests ever reach your model.

The Three Layers Explained

Layer 1: Gateway Layer (Protect)

This is your security perimeter. Every request hits here first.

Rate limiting: Stop abusive requests before they waste tokens
Authentication: Verify the request is actually from your user
Input validation: Reject obviously malformed input
Request deduplication: If the same request came in twice (network retry), don't charge twice

Think of this as the bouncer at the door. Bad requests get rejected here, never reaching your expensive model.

Layer 2: Processing Layer (Think)

This is where your business logic lives. The model doesn't see raw user input—it sees carefully constructed context.

Prompt construction: Build the actual prompt from templates and user input
Context injection: Add relevant history, system rules, retrieved documents
Output parsing: Extract structured data from the model's response
Caching: For repeated queries, return cached responses without hitting the model
Retry logic: If the model returns malformed output, retry automatically

The processing layer is where you control quality. You construct smarter prompts than users could write. You parse outputs into formats your app can use. You cache aggressively.

Layer 3: Intelligence Layer (Generate)

The model call itself. Here's the key: it should be swappable.

Today's primary: Claude Sonnet
Today's backup: GPT-4o mini
Tomorrow: Whatever's better
Your app doesn't care. It just sends structured input and gets structured output.

This decoupling means you can upgrade models, add fallbacks, or switch providers without touching your application logic.

The Pattern in Code

User Request ↓ [Gateway Layer: Rate limit, auth, validate] ↓ [Processing Layer: Build prompt, inject context, check cache] ↓ [If cached → return cached response] ↓ [If not cached → Intelligence Layer: Call model] ↓ [Parse output, cache result, return to user]

Why This Matters

Resilience: If the model times out, your gateway still rejected bad requests. Your processing layer can return a graceful error instead of crashing.
Cost control: Caching and deduplication happen before model calls. You literally don't pay for repeated queries.
Quality: You control exactly what the model sees and how you use what it returns. You're not building a thin wrapper.
Observability: Each layer is a checkpoint. You can see where requests are failing.

2. Temperature: What the Number Actually Means

The Cargo Cult Problem

Every tutorial says "set temperature to 0.7." Nobody can explain why. It became a default through repetition, not reasoning.

Temperature controls randomness. That's it. Understanding the number means understanding what you actually want your model to do.

What Temperature Does

Temperature = 0 (Deterministic)

The model picks the single most likely next token, every time. Same input, same output, guaranteed.

Temperature = 1.0 (Baseline randomness)

The model samples from the full probability distribution. Lots of variation, but still statistically weighted toward likely tokens.

Temperature > 1.0 (Chaotic)

The model samples from a flattened distribution. Even unlikely tokens become more common. Useful for brainstorming, but risky for anything that needs consistency.

When to Use Each

Use 0 for:

Data extraction: "Extract all email addresses from this text" — you need exact, repeatable results
Classification: "Is this review positive or negative?" — ambiguity costs accuracy
JSON generation: "Output this as JSON" — malformed output breaks your app
Any structured output: API responses, code generation where syntax matters

Use 0.3–0.5 for:

Summaries: A little variation makes summaries feel less robotic, but they should still capture the same points
Rewrites: Different phrasings of the same core message
Code generation: Most code variations are functionally equivalent; slight variation is fine
Translations: Minor stylistic differences don't break meaning

Use 0.7–1.0 for:

Creative writing: You want the model to surprise you
Brainstorming: Generate many different ideas, not the obvious one
Open-ended ideation: The point is variation
Content where "sameness" is the problem

The Real Lesson

Temperature is not a magic number. It's a control. Pick it based on what you need:

Do you need exact, repeatable output? Use 0.
Do you need some variation but mostly reliable? Use 0.3–0.5.
Do you need the model to explore possibilities? Use 0.7–1.0.

Don't inherit settings from blog posts. Think about your actual use case.

3. The AI Startup Stack Under $100/month

The Founder's Mistake

Most teams overbuild infrastructure before they have 10 users. They spin up Kubernetes, hire DevOps, add monitoring, set up CI/CD pipelines. Then they realize they built infrastructure, not a product.

Here's what you actually need to ship and validate an AI product.

The Stack

Compute + Frontend: Vercel ($0–$20/month)

Deploys Next.js or React in one command
Free tier covers serious traffic (500GB bandwidth, 100GB function execution)
Zero DevOps overhead
Perfect for MVP

When to upgrade: When you hit the edge of the free tier limits. Vercel's Pro plan ($20/month) gets you basically unlimited.

Database + Auth + Storage: Supabase ($0–$50/month)

PostgreSQL with row-level security built in
Auth (social login, passwordless, 2FA)
File storage
All in one dashboard

The free tier covers most MVPs. You get 500MB storage, 5GB egress per month, and 50,000 monthly active users. Most early products never hit these limits.

When to upgrade: When you exceed the free tier limits, the Pro plan is $25/month.

AI Layer: Claude API or OpenAI ($0–$100/month)

Claude Haiku: Fastest, cheapest model. Perfect for simple tasks, summaries, basic classification.
Claude Sonnet: Best value. Handles complex reasoning, code, structured data.
Add prompt caching: Cache the system prompt and large context. Only pay for new tokens in each request. This alone can cut costs by 50–80% for repetitive queries.

Example: If your app always includes a 5000-token instruction document, caching means you only pay for that once per hour. Huge savings at scale.

Observability: Helicone or LangSmith (Free tier)

See what your AI is doing: request latency, cost per request, error rates
Identify which prompts are slow or expensive
Catch hallucinations in production before users see them

Free tier logs ~1000 requests/month. Good enough to start. When you exceed it, Helicone Pro is $7/month. LangSmith Pro is $20/month.

Total Cost: ~$45/month

Vercel: $0 (free tier)
Supabase: $0 (free tier)
Claude API: $30/month (rough estimate for 1M tokens)
Helicone: $0 (free tier)

This scales you to thousands of users.

When Each Component Upgrades

Component	Trigger	Cost
Vercel	500GB bandwidth exceeded	$20/month
Supabase	500MB storage exceeded	$25/month
Claude API	Usage increases	Variable, usually $50–$200/month
Helicone	1000 requests/month exceeded	$7/month

You don't need to upgrade everything at once. You upgrade as you hit the ceiling on each piece.

4. Why Your AI App Needs a Circuit Breaker

What Happened Last Time Your Provider Went Down

OpenAI went offline. Every app calling their API with no fallback died. Users saw blank screens. Support was flooded. Revenue stopped.

The apps that survived? They had circuit breakers.

What a Circuit Breaker Does

A circuit breaker is a pattern borrowed from distributed systems. It wraps calls to a service and monitors for failures.

Three States:

Closed (working normally): Requests flow through. Everything is fine.
Open (failures detected): The service has failed too many times in a row. You stop calling it immediately. You don't keep hammering a dead service. You switch to fallback or return a graceful error.
Half-Open (testing recovery): After waiting, you send one test request. If it succeeds, go back to Closed. If it fails, go back to Open.

In AI Apps: Pair with a Fallback Model

Primary fails? Route to backup.
Backup fails? Return a graceful error.
Users see: A slow response, not a crash.

User makes request ↓ Try Claude Sonnet (primary) ↓ If timeout or error: Try GPT-4o mini (fallback) ↓ If that fails too: Return "Response is slow. Try again in a moment."

Users get a response. Your system survives the outage. You lost a few requests, but not the whole product.

Why This Matters

Your users are more resilient than your infrastructure. They can wait a few seconds or retry. They can't wait for you to manually fix production.
Outages are inevitable. Not if, when. APIs go down. Models have rate limits. Networks fail. Plan for it.
Fallbacks are cheap insurance. Adding a backup model adds maybe 5% to your token cost but saves you from total failure.

5. AI Wrappers vs Real AI Products

The Hard Truth

An AI wrapper is a prompt and an API call. An AI product is something that gets harder to copy every week it runs.

Most people building "AI startups" are building wrappers. They'll be dead in 6 months when the model provider ships the same feature natively.

What's the Difference?

A Wrapper

Takes user input
Sends it to GPT or Claude
Returns the output
No differentiation

Anyone can build this in a weekend. Anyone will. There's no moat.

An AI Product

Has three things a wrapper doesn't:

1. Proprietary Data

The model knows things about your users that no generic API can replicate.

Example: A fitness app that's been tracking a user's workouts for 2 years. The AI knows their progress, their injury history, their preferences. ChatGPT can't replicate that. When you use the AI coach, it's personalized in ways the generic model can't be.

2. Custom Workflows

The AI fits into a specific business process, not a generic chat box.

Example: A legal brief assistant isn't "ask Claude about law." It's "extract relevant case law, cite format this way, organize under these sections, flag contradictions." The workflow is your intellectual property.

3. Compounding Value

It gets better the more people use it.

Example: A coding assistant that learns from your codebase. The more you use it, the more it understands your patterns. A new developer on your team gets an AI that already knows your style guide.

The Test

If someone could replace your product with a ChatGPT subscription, you have a wrapper.

If they'd lose months of workflow history, personalization, and context, you have a product.

What This Means for Building

Don't launch a prompt wrapper and call it a startup. Launch something that:

Learns from user behavior
Integrates into a specific workflow
Gets better over time
Isn't replicable in an afternoon

6. Latency Budget: Hitting Sub-2-Second Responses

Why Speed Matters

Users abandon slow AI responses faster than they abandon slow websites.

At 2 seconds, you start to feel slow.
At 4 seconds, users are refreshing.
At 6 seconds, they've left.

But here's the thing: perceived latency and actual latency are different. You can make a 5-second response feel fast.

Strategy 1: Stream Everything

Don't wait for the full response. Start sending tokens as they arrive.

The user sees text appearing immediately. They're reading while the model is still thinking. Perceived latency drops by 70% even if actual latency is the same.

This is why ChatGPT feels fast even though it often takes 5+ seconds for a full response. You're reading from second 1.

Strategy 2: Model Selection

GPT-4o mini and Claude Haiku return first tokens 3–5x faster than frontier models.

For most product tasks—summarization, classification, basic reasoning—the quality difference is invisible to users.

But the speed difference is obvious.

Haiku takes 300ms to first token.
Sonnet takes 800ms to first token.

On a 10-request workflow, that's 5 seconds saved just by choosing the faster model for the right tasks.

Strategy 3: Cache Aggressively

Common queries, shared context, static system prompts: serve them from cache.

Zero LLM latency for repeated inputs.

Example: Your app always uses the same 2000-token system prompt. With prompt caching, you pay for it once. Every subsequent request re-uses it from cache. Immediate response for the static part.

The Latency Budget Template

Component	Budget	Notes
Cache lookup	100ms	Redis, in-memory store
RAG retrieval	200ms	Vector DB + embedding search
Model first token	300ms	Using Haiku or mini
Streaming	1-3s	User reads while generation continues
Total perceived	~2 seconds	User sees first token at ~500ms

If your app hits this budget:

Cache hit: 100ms response
Normal hit: 600ms first token + streaming to completion
Cold hit with RAG: 1.2s first token + streaming to completion

All of these feel fast to users.

Quick Wins

Add streaming. Single biggest win for perceived latency.
Use Haiku for 80% of requests. Save 500ms per request.
Cache system prompts. Immediate win if you use context injection.
Upgrade to Vercel Pro or similar. Cold start times drop significantly.

7. Function Calling vs Tool Use vs MCP

Three Terms, Often Confused

People use "function calling," "tool use," and "MCP" interchangeably. They're not the same thing. Here's what's different.

Function Calling

The model says: "I want to call this function with these arguments."

Your code:

Sees the function call
Decides whether to actually run it
Runs it (or doesn't)
Returns the result to the model

OpenAI introduced this concept. It's now standard across all major models.

User: "What's the weather in San Francisco?" Model: "I should call get_weather(city='San Francisco')" Your code: Calls get_weather, gets "72°F, sunny" Your code: Returns result to model Model: "It's 72°F and sunny in San Francisco."

This lives inside your application. Your code owns the functions. Your code decides what the model can call.

Tool Use

Anthropic's term for the same concept. Functionally identical to function calling.

The model outputs a structured tool call. Your application executes it. The terms are basically interchangeable.

The only difference: naming and framing. Anthropic prefers "tool use." OpenAI says "function calling." Same mechanism.

MCP: The Protocol Layer

MCP (Model Context Protocol) is different. It's not about calling functions. It's about standardizing how tools are discovered, connected, and shared across systems.

Instead of hardcoding tools into your app, an MCP server exposes them. Any MCP-compatible client can use them. You write the tool once. Multiple apps use it.

Example: You build a "company knowledge" MCP server that exposes:

search_wiki
lookup_employee
get_org_chart

Your AI assistant uses it. Your chatbot uses it. Your automation tool uses it. You maintain it in one place.

MCP is the USB-C of AI tooling. Standardization that lets tools work everywhere.

The Hierarchy

Function Calling/Tool Use → Model calls your functions
MCP → Standardized protocol for discovering and calling tools across systems

8. Embeddings Explained

The Core Idea

Computers can't compare words. They can compare numbers.

An embedding model converts each word (or sentence or document) into a list of hundreds of numbers representing its position in "meaning-space."

Why This Works

Similar meanings produce similar numbers. Different meanings produce different numbers. Distance equals difference.

Example:

"Cat" and "kitten" end up close together (both are small animals)
"Cat" and "car engine" end up far apart (different meanings entirely)

The distance between vectors is the distance between meanings.

How This Powers RAG

Semantic search works like this:

You embed all your documents into vectors
User asks a question
You embed the question into a vector
Find the documents whose vectors are closest to the question vector
Return those documents as context to the model

The model now has relevant context. It answers based on your documents, not its training data.

Why This Matters for Your App

Memory: The model understands context inside a prompt because everything has been turned into embeddings first.
Search: Vector similarity is how you find relevant information at scale.
Personalization: User preferences can be embedded. Similar users cluster together. You serve similar content.
Anomaly detection: Unusual input produces unusual embeddings. You catch strange requests before they reach your model.

Two Key Metrics

Embedding Dimension: How many numbers in each vector. 384 dimensions (small), 1536 dimensions (large). Larger = more detail, more compute.
Similarity Score: Usually 0 to 1. 1 = identical meaning, 0 = unrelated. Use this as your relevance threshold.

9. Why 90% of AI MVPs Fail After Launch

It Worked in the Demo

You demo the product. It's incredible. Users love it. You launch.

Then it breaks in production. Within a week, you're debugging edge cases you never saw in testing.

Here are the four things nobody builds before launch. They should.

1. Error Handling

The model times out. The API returns a 500. The response is malformed JSON.

If your app has no error handling, users see:

A blank screen
A raw stack trace
Nothing for 30 seconds

That's a dead product.

What to build:

Catch every error. Return something useful.

try: response = client.messages.create(...) except RateLimitError: return "System is busy. Try in a moment." except APIError as e: return f"Technical issue: {e.status_code}. We're working on it."

2. Fallbacks

Your primary model is down. Your app should route to a backup silently. The user should never know.

Most MVPs have one model and no fallback.

What to build:

At least a backup model or provider.

Primary: Claude Sonnet Backup: GPT-4o mini Last resort: Return cached response or error message

When primary fails, users still get a response. It might be slower. It might be from a cheaper model. But the app doesn't crash.

3. Monitoring

You have no idea what your users are actually asking. You don't know which prompts fail. You don't know where cost is going.

Flying blind in production is how you miss problems until they're crises.

What to build:

Log every request (anonymized). Track:

Response latency
Cost per request
Error rates
Which prompts are slow or expensive

Hook up Helicone, LangSmith, or roll your own.

4. Feedback Loops

Users need a way to say "this answer is wrong."

Without that signal, you have no idea your AI is confidently giving bad answers.

What to build:

A thumbs-down button. Read it. Fix it.

Even better: "This answer is: wrong, confusing, missing info, too long, off-topic." Let users categorize the failure.

Now you have a dataset of failures. Use it to improve your prompt or your retrieval.

Before You Launch

Error handling for every API call
A backup model
Monitoring/logging
User feedback mechanism

Don't launch without them. Not after. Before.

10. The Real Cost of an API Call

Most Founders Don't Actually Know What They're Paying

They look at the token price and think that's the cost.

It's not.

Breaking Down the Cost

1. Token Price (Visible)

Input tokens: ~$0.003 per 1K tokens (Claude)
Output tokens: ~$0.015 per 1K tokens (Claude, 5x more expensive)

Most people forget that output tokens scale with response length. A 2000-token response costs 5x more than a 400-token response.

2. Retry Cost (Hidden)

Your model call fails 2–5% of the time. Your code retries.

You just paid for the same tokens twice or three times.

Under load (when you need retries most), this adds up fast.

Example: 1000 requests, 3% failure rate = 30 retries. You paid 3x for 30 requests you thought were free.

3. Latency Cost (Sneaky)

Slow responses mean users refresh, retry, or abandon.

Every redundant request is money burned.

Latency isn't just UX—it's a cost multiplier.

Example: If 5% of users retry because your response is slow, you've lost 5% of your token budget to latency.

4. The Real Calculation

(Input tokens × input_rate + Output tokens × output_rate) × (1 + retry_rate) × requests_per_day

This is your actual cost. Not the token price alone.

Making It Cheaper

1. Use smaller models for routine tasks

GPT-4o mini for classification, summaries, routing: 90% cheaper
Claude Haiku for simple tasks: 90% cheaper
Save frontier models (Sonnet, GPT-4o) for complex reasoning

Most teams can cut token cost by 50% just by sorting requests into the right model tier.

2. Reduce retry rate

Add error handling so requests don't fail
Use exponential backoff instead of immediate retry
Cache responses so you don't retry the same query

3. Cache aggressively

System prompts: Cache them. You pay once per hour.
Repeated queries: Cache them. Cost drops to near-zero.
Long context: Cache it. Massive savings.

4. Optimize response length

Don't ask for 5000 tokens when you need 500
Use structured output to reduce verbosity
Tune your prompt to be tighter

Example: Real Numbers

Scenario	Input	Output	Retry	Cost
Unoptimized	1000 tokens	2000 tokens	3% failure	$0.0147 per request
Optimized (Haiku 80%, Sonnet 20%, caching)	800 tokens	800 tokens	1% failure	$0.0032 per request

Difference: 78% cost reduction by being smart about model choice and caching.

At 10,000 requests/month, that's $103 vs $24.

11. AI Memory Systems: The Three Types

The Problem You're Solving

A user tells your AI their name on day 1. They come back on day 5. Your AI has no idea who they are.

That's a bad product.

Production systems handle three types of memory, each solving a different problem.

Short-Term Memory

What it is:

The conversation context window.

Everything in the current session. The model has this by default.

User: "My name is Alex. I run a startup." Model: Reads this and remembers it User (later in same session): "How many employees should I hire?" Model: "As a founder, you should consider..."

This works great until the session ends. When Alex closes the app and comes back tomorrow, the model has forgotten everything.

Best for:

Continuity within a single conversation.

Long-Term Memory

What it is:

A database that persists facts about users.

After each session, extract key facts:

Name: Alex
Role: Founder
Company stage: Early
Industry: SaaS

Store them in a database.

Next session, retrieve and inject into the system prompt:

"You are talking to Alex, a founder running a SaaS startup in the early stage."

Now the AI feels like it knows the user.

Best for:

Personal context across sessions.

How to implement:

At end of each session, call the model: "Extract key facts about this user"
Store in a simple user profile table
On next session, load profile and inject into system prompt

Tools:

Supabase (database), vector DB for semantic search, or just PostgreSQL with a users table.

Episodic Memory

What it is:

Specific past interactions, not just facts.

Not just "Alex is a founder." But "Last time you asked about pricing, I suggested value-based pricing instead of per-seat."

Store session summaries in a vector database. Retrieve the relevant ones when context matches.

User (new session): "What pricing model should I use?" System: Search episodic memory for "pricing discussions" Result: "Found previous discussion about value-based vs per-seat" Load that session summary as context Model: Remembers the previous conversation

Expensive (requires vector storage, embedding searches) but powerful.

Best for:

Products where user history is a core part of value.

Therapy apps: "You mentioned anxiety about X last week"
Coaching: "You've been struggling with Y for 3 sessions"
Long-running project assistants: "You decided on this architecture in week 2"

Which One to Build

Most products:

Short-term + basic long-term.

Current session works by default
Basic user profile in database
Done

Complex products:

Add episodic memory.

Where user history is core value
Where remembering past decisions matters
Where the AI needs more than just facts

Implementation Complexity

Type	Storage	Complexity	Cost	When
Short-term	Context window	Built-in	Included	Always
Long-term	Database	Simple	Low	Always
Episodic	Vector DB	Medium	Medium	If history is value

12. Evaluation-Driven Development

The Validation Problem

You change a prompt. It feels better. But does it actually work better?

If your answer is "it feels better," you're guessing.

How Evaluation-Driven Development Works

Before you write a prompt, write a test.

Define your test set: 20–50 representative inputs
Define good output: What does a correct answer look like?
Score the outputs: Either manually, or use another LLM as a judge
Compare scores: Version 2 scores 14% higher on accuracy than version 1

Now you're not guessing. You know version 2 is better because the numbers say so.

The Discipline

Never ship a prompt change without running evals.

It takes 10 minutes to set up. It saves hours of debugging production regressions.

Simple Setup

Step 1: Create your test set

[{"input": "Summarize this customer feedback into 2 sentences", "context": "[long customer review]", "expected_qualities": [ "Must mention main complaint", "Must be under 50 tokens", "Must be actionable" ]}, ...repeat for 20-50 examples ]

Step 2: Define your scoring rubric

Manual scoring:

1: Fails all criteria
2: Meets 1 criteria
3: Meets 2 criteria
4: Meets all criteria

Or use an LLM as judge:

Evaluate this summary against the criteria above. Score 1-4. Explain your score.

Step 3: Run your baseline

Run all 20 test cases through your current prompt. Record scores.

Average score: 3.2/4.0

Step 4: Try a new prompt

Change something (add context, adjust tone, change format).

Run the same 20 test cases.

New average: 3.6/4.0

Step 5: Decide

Is 3.6 better enough to ship? Probably. You have evidence.

Why This Matters

You're not shipping based on hunches. You're shipping based on measurements.

When something breaks in production, you can debug it: "This input type scores 1.2 on the eval set, but this input type scores 3.8. What's different?"

13. Fine-Tuning vs RAG vs Prompt Engineering

Three Ways to Make an LLM Smarter

Only one is right for your situation. Most people pick wrong.

Start with Prompt Engineering

Always. It's free, fast, and often enough.

If you can write a prompt that produces the output you want consistently, stop here.

You are a customer support AI for a SaaS company. Respond in under 100 words. If you don't know the answer, say "I don't know" instead of guessing. Use a friendly but professional tone. Customer: "How do I reset my password?"

This is prompt engineering. It's powerful. Most people leave this step too early and jump to fine-tuning.

When to move past it: When you've written a really good prompt and it still doesn't work reliably for your use case.

Use RAG If You Have a Knowledge Problem

The model doesn't have the right knowledge. It doesn't know your docs. It doesn't know your company's specific data. It doesn't know recent events.

You're not changing the model's behavior. You're feeding it better context.

System prompt: "You are a support AI for Acme Inc." Knowledge base: [All support docs, FAQs, policy documents] User query: "What's your refund policy?" You retrieve the relevant policies. You inject them into the prompt. Model answers based on your actual policy, not hallucinated policy.

RAG is about retrieval, not training.

When to use: When the model's problem is "it doesn't have enough context" not "it doesn't know how to behave."

Use Fine-Tuning If You Have a Behavior Problem

You need to change how the model responds, not just what it knows.

Different tone. Different output format. Different reasoning style that prompts can't reliably produce.

Example: You want your AI to:

Always respond in a specific format
Never suggest competitors
Always cite sources in a particular way
Adopt a very specific tone (e.g., Shakespearean, extremely formal)

If prompt engineering and RAG can't make it do this consistently, fine-tune.

Cost: Money (training), time (data collection), complexity (managing a custom model).

The Decision Tree

1. Can you solve it with a good prompt? YES → Use prompt engineering 2. No, the model lacks knowledge. → Use RAG 3. No, the model needs to behave differently. → Use fine-tuning

If you answer "maybe it needs fine-tuning," you probably don't need it yet.

Fine-tuning is expensive. Do it only when you're sure.

The Real Numbers

Approach	Time to Deploy	Cost	Maintenance
Prompt engineering	1 hour	$0	Low
RAG	1 day	$0–$100	Medium
Fine-tuning	1 week	$100–$500+	High

Start cheap. Move up only when necessary.

14. Surviving Rate Limits

What Happens

Too many requests in a window. You get a 429 error. Without handling, every user at once sees an error. Cascading failure.

This is predictable. Plan for it.

Response 1: Exponential Backoff

Wait 1 second. Retry. Fail again? Wait 2 seconds. Retry. 4 seconds. 8 seconds.

Most transient rate limits clear in under 10 seconds.

def call_with_backoff(fn, max_retries=5): for attempt in range(max_retries): try: return fn() except RateLimitError: wait_time = 2 ** attempt # 1, 2, 4, 8, 16 seconds time.sleep(wait_time) raise Exception("Max retries exceeded")

This works for transient issues. But it blocks the user while waiting.

Response 2: Request Queue

New requests go into a queue. A worker pulls from the queue at a controlled rate, staying under the limit.

Users get slightly slower responses instead of errors.

User Request → Queue → Worker (rate-limited) → Model

The worker never exceeds rate limits. The queue absorbs spikes.

Example: Limit yourself to 10 requests per second even if the API allows 100. When traffic spikes, users wait in queue instead of failing.

Response 3: Fallback Models

Primary model hits rate limit? Route to a different provider.

Request ↓ Try Claude (primary) ↓ Rate limited Try GPT-4o mini (fallback) ↓ Also limited (rare) Return cached response or graceful error

Users never see the limit. Your architecture quietly switches lanes.

Best Practices

Monitor your usage. Before you hit the limit, you're at 80% of it.
Set your own limit lower. If the API allows 1000/min, limit yourself to 800. Leave headroom.
Implement all three. Backoff for transient issues. Queue for sustained load. Fallback for provider problems.
Alert early. When you hit 70% usage, alert your team. Investigate why.

Conclusion: Building for Production

The difference between an MVP that dies and one that scales isn't the idea. It's the architecture.

Build with these patterns from day one:

3-layer architecture: Gateway → Processing → Intelligence
Circuit breakers and fallbacks: Plan for failure
Monitoring and feedback loops: See what's happening
Evaluation-driven development: Don't guess
Right tool for the job: Use Haiku for simple tasks, not Sonnet
Memory systems: Remember context across sessions
Graceful degradation: Slow is better than broken

These are the difference between "it worked in the demo" and "it works at scale."

Quick Reference Checklists

Before Launch Checklist

Error handling for all API calls
Backup model configured
Monitoring/logging in place
User feedback mechanism (thumbs down)
Rate limit handling (backoff, queue, or fallback)
Latency budget tested
Cost analysis run
Eval set created for your core prompts

Production Readiness

Circuit breaker pattern implemented
Cache strategy defined
Long-term memory system in place (if needed)
Observability dashboard set up
Fallback provider configured
Cost alerts set to trigger at thresholds
Docs on how to investigate failures