The AI Engineering Handbook
Best Practices for Building Production AI Products

The Squirrel Team
1. The 3-Layer AI Architecture
The Problem
Most AI apps are exactly one thing: a frontend, an API call, and a model. When something breaks—and it will—the entire product dies.
Real production systems have layers. Each one solves a specific problem before requests ever reach your model.
The Three Layers Explained
Layer 1: Gateway Layer (Protect)
This is your security perimeter. Every request hits here first.
- Rate limiting: Stop abusive requests before they waste tokens
- Authentication: Verify the request is actually from your user
- Input validation: Reject obviously malformed input
- Request deduplication: If the same request came in twice (network retry), don't charge twice
Think of this as the bouncer at the door. Bad requests get rejected here, never reaching your expensive model.
Layer 2: Processing Layer (Think)
This is where your business logic lives. The model doesn't see raw user input—it sees carefully constructed context.
- Prompt construction: Build the actual prompt from templates and user input
- Context injection: Add relevant history, system rules, retrieved documents
- Output parsing: Extract structured data from the model's response
- Caching: For repeated queries, return cached responses without hitting the model
- Retry logic: If the model returns malformed output, retry automatically
The processing layer is where you control quality. You construct smarter prompts than users could write. You parse outputs into formats your app can use. You cache aggressively.
Layer 3: Intelligence Layer (Generate)
The model call itself. Here's the key: it should be swappable.
- Today's primary: Claude Sonnet
- Today's backup: GPT-4o mini
- Tomorrow: Whatever's better
- Your app doesn't care. It just sends structured input and gets structured output.
This decoupling means you can upgrade models, add fallbacks, or switch providers without touching your application logic.
The Pattern in Code
Why This Matters
- Resilience: If the model times out, your gateway still rejected bad requests. Your processing layer can return a graceful error instead of crashing.
- Cost control: Caching and deduplication happen before model calls. You literally don't pay for repeated queries.
- Quality: You control exactly what the model sees and how you use what it returns. You're not building a thin wrapper.
- Observability: Each layer is a checkpoint. You can see where requests are failing.
2. Temperature: What the Number Actually Means
The Cargo Cult Problem
Every tutorial says "set temperature to 0.7." Nobody can explain why. It became a default through repetition, not reasoning.
Temperature controls randomness. That's it. Understanding the number means understanding what you actually want your model to do.
What Temperature Does
Temperature = 0 (Deterministic)
The model picks the single most likely next token, every time. Same input, same output, guaranteed.
Temperature = 1.0 (Baseline randomness)
The model samples from the full probability distribution. Lots of variation, but still statistically weighted toward likely tokens.
Temperature > 1.0 (Chaotic)
The model samples from a flattened distribution. Even unlikely tokens become more common. Useful for brainstorming, but risky for anything that needs consistency.
When to Use Each
Use 0 for:
- Data extraction: "Extract all email addresses from this text" — you need exact, repeatable results
- Classification: "Is this review positive or negative?" — ambiguity costs accuracy
- JSON generation: "Output this as JSON" — malformed output breaks your app
- Any structured output: API responses, code generation where syntax matters
Use 0.3–0.5 for:
- Summaries: A little variation makes summaries feel less robotic, but they should still capture the same points
- Rewrites: Different phrasings of the same core message
- Code generation: Most code variations are functionally equivalent; slight variation is fine
- Translations: Minor stylistic differences don't break meaning
Use 0.7–1.0 for:
- Creative writing: You want the model to surprise you
- Brainstorming: Generate many different ideas, not the obvious one
- Open-ended ideation: The point is variation
- Content where "sameness" is the problem
The Real Lesson
Temperature is not a magic number. It's a control. Pick it based on what you need:
- Do you need exact, repeatable output? Use 0.
- Do you need some variation but mostly reliable? Use 0.3–0.5.
- Do you need the model to explore possibilities? Use 0.7–1.0.
Don't inherit settings from blog posts. Think about your actual use case.
3. The AI Startup Stack Under $100/month
The Founder's Mistake
Most teams overbuild infrastructure before they have 10 users. They spin up Kubernetes, hire DevOps, add monitoring, set up CI/CD pipelines. Then they realize they built infrastructure, not a product.
Here's what you actually need to ship and validate an AI product.
The Stack
Compute + Frontend: Vercel ($0–$20/month)
- Deploys Next.js or React in one command
- Free tier covers serious traffic (500GB bandwidth, 100GB function execution)
- Zero DevOps overhead
- Perfect for MVP
When to upgrade: When you hit the edge of the free tier limits. Vercel's Pro plan ($20/month) gets you basically unlimited.
Database + Auth + Storage: Supabase ($0–$50/month)
- PostgreSQL with row-level security built in
- Auth (social login, passwordless, 2FA)
- File storage
- All in one dashboard
The free tier covers most MVPs. You get 500MB storage, 5GB egress per month, and 50,000 monthly active users. Most early products never hit these limits.
When to upgrade: When you exceed the free tier limits, the Pro plan is $25/month.
AI Layer: Claude API or OpenAI ($0–$100/month)
- Claude Haiku: Fastest, cheapest model. Perfect for simple tasks, summaries, basic classification.
- Claude Sonnet: Best value. Handles complex reasoning, code, structured data.
- Add prompt caching: Cache the system prompt and large context. Only pay for new tokens in each request. This alone can cut costs by 50–80% for repetitive queries.
Example: If your app always includes a 5000-token instruction document, caching means you only pay for that once per hour. Huge savings at scale.
Observability: Helicone or LangSmith (Free tier)
- See what your AI is doing: request latency, cost per request, error rates
- Identify which prompts are slow or expensive
- Catch hallucinations in production before users see them
Free tier logs ~1000 requests/month. Good enough to start. When you exceed it, Helicone Pro is $7/month. LangSmith Pro is $20/month.
Total Cost: ~$45/month
- Vercel: $0 (free tier)
- Supabase: $0 (free tier)
- Claude API: $30/month (rough estimate for 1M tokens)
- Helicone: $0 (free tier)
When Each Component Upgrades
| Component | Trigger | Cost |
|---|---|---|
| Vercel | 500GB bandwidth exceeded | $20/month |
| Supabase | 500MB storage exceeded | $25/month |
| Claude API | Usage increases | Variable, usually $50–$200/month |
| Helicone | 1000 requests/month exceeded | $7/month |
You don't need to upgrade everything at once. You upgrade as you hit the ceiling on each piece.
4. Why Your AI App Needs a Circuit Breaker
What Happened Last Time Your Provider Went Down
OpenAI went offline. Every app calling their API with no fallback died. Users saw blank screens. Support was flooded. Revenue stopped.
The apps that survived? They had circuit breakers.
What a Circuit Breaker Does
A circuit breaker is a pattern borrowed from distributed systems. It wraps calls to a service and monitors for failures.
Three States:
- Closed (working normally): Requests flow through. Everything is fine.
- Open (failures detected): The service has failed too many times in a row. You stop calling it immediately. You don't keep hammering a dead service. You switch to fallback or return a graceful error.
- Half-Open (testing recovery): After waiting, you send one test request. If it succeeds, go back to Closed. If it fails, go back to Open.
In AI Apps: Pair with a Fallback Model
- Primary fails? Route to backup.
- Backup fails? Return a graceful error.
- Users see: A slow response, not a crash.
Users get a response. Your system survives the outage. You lost a few requests, but not the whole product.
Why This Matters
- Your users are more resilient than your infrastructure. They can wait a few seconds or retry. They can't wait for you to manually fix production.
- Outages are inevitable. Not if, when. APIs go down. Models have rate limits. Networks fail. Plan for it.
- Fallbacks are cheap insurance. Adding a backup model adds maybe 5% to your token cost but saves you from total failure.
5. AI Wrappers vs Real AI Products
The Hard Truth
An AI wrapper is a prompt and an API call. An AI product is something that gets harder to copy every week it runs.
Most people building "AI startups" are building wrappers. They'll be dead in 6 months when the model provider ships the same feature natively.
What's the Difference?
A Wrapper
- Takes user input
- Sends it to GPT or Claude
- Returns the output
- No differentiation
Anyone can build this in a weekend. Anyone will. There's no moat.
An AI Product
Has three things a wrapper doesn't:
1. Proprietary Data
The model knows things about your users that no generic API can replicate.
Example: A fitness app that's been tracking a user's workouts for 2 years. The AI knows their progress, their injury history, their preferences. ChatGPT can't replicate that. When you use the AI coach, it's personalized in ways the generic model can't be.
2. Custom Workflows
The AI fits into a specific business process, not a generic chat box.
Example: A legal brief assistant isn't "ask Claude about law." It's "extract relevant case law, cite format this way, organize under these sections, flag contradictions." The workflow is your intellectual property.
3. Compounding Value
It gets better the more people use it.
Example: A coding assistant that learns from your codebase. The more you use it, the more it understands your patterns. A new developer on your team gets an AI that already knows your style guide.
The Test
If someone could replace your product with a ChatGPT subscription, you have a wrapper.
If they'd lose months of workflow history, personalization, and context, you have a product.
What This Means for Building
Don't launch a prompt wrapper and call it a startup. Launch something that:
- Learns from user behavior
- Integrates into a specific workflow
- Gets better over time
- Isn't replicable in an afternoon
6. Latency Budget: Hitting Sub-2-Second Responses
Why Speed Matters
Users abandon slow AI responses faster than they abandon slow websites.
At 2 seconds, you start to feel slow.
At 4 seconds, users are refreshing.
At 6 seconds, they've left.
But here's the thing: perceived latency and actual latency are different. You can make a 5-second response feel fast.
Strategy 1: Stream Everything
Don't wait for the full response. Start sending tokens as they arrive.
The user sees text appearing immediately. They're reading while the model is still thinking. Perceived latency drops by 70% even if actual latency is the same.
This is why ChatGPT feels fast even though it often takes 5+ seconds for a full response. You're reading from second 1.
Strategy 2: Model Selection
GPT-4o mini and Claude Haiku return first tokens 3–5x faster than frontier models.
For most product tasks—summarization, classification, basic reasoning—the quality difference is invisible to users.
But the speed difference is obvious.
- Haiku takes 300ms to first token.
- Sonnet takes 800ms to first token.
On a 10-request workflow, that's 5 seconds saved just by choosing the faster model for the right tasks.
Strategy 3: Cache Aggressively
Common queries, shared context, static system prompts: serve them from cache.
Zero LLM latency for repeated inputs.
Example: Your app always uses the same 2000-token system prompt. With prompt caching, you pay for it once. Every subsequent request re-uses it from cache. Immediate response for the static part.
The Latency Budget Template
| Component | Budget | Notes |
|---|---|---|
| Cache lookup | 100ms | Redis, in-memory store |
| RAG retrieval | 200ms | Vector DB + embedding search |
| Model first token | 300ms | Using Haiku or mini |
| Streaming | 1-3s | User reads while generation continues |
| Total perceived | ~2 seconds | User sees first token at ~500ms |
If your app hits this budget:
- Cache hit: 100ms response
- Normal hit: 600ms first token + streaming to completion
- Cold hit with RAG: 1.2s first token + streaming to completion
All of these feel fast to users.
Quick Wins
- Add streaming. Single biggest win for perceived latency.
- Use Haiku for 80% of requests. Save 500ms per request.
- Cache system prompts. Immediate win if you use context injection.
- Upgrade to Vercel Pro or similar. Cold start times drop significantly.
7. Function Calling vs Tool Use vs MCP
Three Terms, Often Confused
People use "function calling," "tool use," and "MCP" interchangeably. They're not the same thing. Here's what's different.
Function Calling
The model says: "I want to call this function with these arguments."
Your code:
- Sees the function call
- Decides whether to actually run it
- Runs it (or doesn't)
- Returns the result to the model
OpenAI introduced this concept. It's now standard across all major models.
This lives inside your application. Your code owns the functions. Your code decides what the model can call.
Tool Use
Anthropic's term for the same concept. Functionally identical to function calling.
The model outputs a structured tool call. Your application executes it. The terms are basically interchangeable.
The only difference: naming and framing. Anthropic prefers "tool use." OpenAI says "function calling." Same mechanism.
MCP: The Protocol Layer
MCP (Model Context Protocol) is different. It's not about calling functions. It's about standardizing how tools are discovered, connected, and shared across systems.
Instead of hardcoding tools into your app, an MCP server exposes them. Any MCP-compatible client can use them. You write the tool once. Multiple apps use it.
Example: You build a "company knowledge" MCP server that exposes:
- search_wiki
- lookup_employee
- get_org_chart
Your AI assistant uses it. Your chatbot uses it. Your automation tool uses it. You maintain it in one place.
MCP is the USB-C of AI tooling. Standardization that lets tools work everywhere.
The Hierarchy
- Function Calling/Tool Use → Model calls your functions
- MCP → Standardized protocol for discovering and calling tools across systems
8. Embeddings Explained
The Core Idea
Computers can't compare words. They can compare numbers.
An embedding model converts each word (or sentence or document) into a list of hundreds of numbers representing its position in "meaning-space."
Why This Works
Similar meanings produce similar numbers. Different meanings produce different numbers. Distance equals difference.
Example:
- "Cat" and "kitten" end up close together (both are small animals)
- "Cat" and "car engine" end up far apart (different meanings entirely)
The distance between vectors is the distance between meanings.
How This Powers RAG
Semantic search works like this:
- You embed all your documents into vectors
- User asks a question
- You embed the question into a vector
- Find the documents whose vectors are closest to the question vector
- Return those documents as context to the model
The model now has relevant context. It answers based on your documents, not its training data.
Why This Matters for Your App
- Memory: The model understands context inside a prompt because everything has been turned into embeddings first.
- Search: Vector similarity is how you find relevant information at scale.
- Personalization: User preferences can be embedded. Similar users cluster together. You serve similar content.
- Anomaly detection: Unusual input produces unusual embeddings. You catch strange requests before they reach your model.
Two Key Metrics
- Embedding Dimension: How many numbers in each vector. 384 dimensions (small), 1536 dimensions (large). Larger = more detail, more compute.
- Similarity Score: Usually 0 to 1. 1 = identical meaning, 0 = unrelated. Use this as your relevance threshold.
9. Why 90% of AI MVPs Fail After Launch
It Worked in the Demo
You demo the product. It's incredible. Users love it. You launch.
Then it breaks in production. Within a week, you're debugging edge cases you never saw in testing.
Here are the four things nobody builds before launch. They should.
1. Error Handling
The model times out. The API returns a 500. The response is malformed JSON.
If your app has no error handling, users see:
- A blank screen
- A raw stack trace
- Nothing for 30 seconds
That's a dead product.
What to build:
Catch every error. Return something useful.
2. Fallbacks
Your primary model is down. Your app should route to a backup silently. The user should never know.
Most MVPs have one model and no fallback.
What to build:
At least a backup model or provider.
When primary fails, users still get a response. It might be slower. It might be from a cheaper model. But the app doesn't crash.
3. Monitoring
You have no idea what your users are actually asking. You don't know which prompts fail. You don't know where cost is going.
Flying blind in production is how you miss problems until they're crises.
What to build:
Log every request (anonymized). Track:
- Response latency
- Cost per request
- Error rates
- Which prompts are slow or expensive
Hook up Helicone, LangSmith, or roll your own.
4. Feedback Loops
Users need a way to say "this answer is wrong."
Without that signal, you have no idea your AI is confidently giving bad answers.
What to build:
A thumbs-down button. Read it. Fix it.
Even better: "This answer is: wrong, confusing, missing info, too long, off-topic." Let users categorize the failure.
Now you have a dataset of failures. Use it to improve your prompt or your retrieval.
Before You Launch
- Error handling for every API call
- A backup model
- Monitoring/logging
- User feedback mechanism
Don't launch without them. Not after. Before.
10. The Real Cost of an API Call
Most Founders Don't Actually Know What They're Paying
They look at the token price and think that's the cost.
It's not.
Breaking Down the Cost
1. Token Price (Visible)
- Input tokens: ~$0.003 per 1K tokens (Claude)
- Output tokens: ~$0.015 per 1K tokens (Claude, 5x more expensive)
Most people forget that output tokens scale with response length. A 2000-token response costs 5x more than a 400-token response.
2. Retry Cost (Hidden)
Your model call fails 2–5% of the time. Your code retries.
You just paid for the same tokens twice or three times.
Under load (when you need retries most), this adds up fast.
Example: 1000 requests, 3% failure rate = 30 retries. You paid 3x for 30 requests you thought were free.
3. Latency Cost (Sneaky)
Slow responses mean users refresh, retry, or abandon.
Every redundant request is money burned.
Latency isn't just UX—it's a cost multiplier.
Example: If 5% of users retry because your response is slow, you've lost 5% of your token budget to latency.
4. The Real Calculation
This is your actual cost. Not the token price alone.
Making It Cheaper
1. Use smaller models for routine tasks
- GPT-4o mini for classification, summaries, routing: 90% cheaper
- Claude Haiku for simple tasks: 90% cheaper
- Save frontier models (Sonnet, GPT-4o) for complex reasoning
Most teams can cut token cost by 50% just by sorting requests into the right model tier.
2. Reduce retry rate
- Add error handling so requests don't fail
- Use exponential backoff instead of immediate retry
- Cache responses so you don't retry the same query
3. Cache aggressively
- System prompts: Cache them. You pay once per hour.
- Repeated queries: Cache them. Cost drops to near-zero.
- Long context: Cache it. Massive savings.
4. Optimize response length
- Don't ask for 5000 tokens when you need 500
- Use structured output to reduce verbosity
- Tune your prompt to be tighter
Example: Real Numbers
| Scenario | Input | Output | Retry | Cost |
|---|---|---|---|---|
| Unoptimized | 1000 tokens | 2000 tokens | 3% failure | $0.0147 per request |
| Optimized (Haiku 80%, Sonnet 20%, caching) | 800 tokens | 800 tokens | 1% failure | $0.0032 per request |
Difference: 78% cost reduction by being smart about model choice and caching.
At 10,000 requests/month, that's $103 vs $24.
11. AI Memory Systems: The Three Types
The Problem You're Solving
A user tells your AI their name on day 1. They come back on day 5. Your AI has no idea who they are.
That's a bad product.
Production systems handle three types of memory, each solving a different problem.
Short-Term Memory
What it is:
The conversation context window.
Everything in the current session. The model has this by default.
This works great until the session ends. When Alex closes the app and comes back tomorrow, the model has forgotten everything.
Best for:
Continuity within a single conversation.
Long-Term Memory
What it is:
A database that persists facts about users.
After each session, extract key facts:
- Name: Alex
- Role: Founder
- Company stage: Early
- Industry: SaaS
Store them in a database.
Next session, retrieve and inject into the system prompt:
Now the AI feels like it knows the user.
Best for:
Personal context across sessions.
How to implement:
- At end of each session, call the model: "Extract key facts about this user"
- Store in a simple user profile table
- On next session, load profile and inject into system prompt
Tools:
Supabase (database), vector DB for semantic search, or just PostgreSQL with a users table.
Episodic Memory
What it is:
Specific past interactions, not just facts.
Not just "Alex is a founder." But "Last time you asked about pricing, I suggested value-based pricing instead of per-seat."
Store session summaries in a vector database. Retrieve the relevant ones when context matches.
Expensive (requires vector storage, embedding searches) but powerful.
Best for:
Products where user history is a core part of value.
- Therapy apps: "You mentioned anxiety about X last week"
- Coaching: "You've been struggling with Y for 3 sessions"
- Long-running project assistants: "You decided on this architecture in week 2"
Which One to Build
Most products:
Short-term + basic long-term.
- Current session works by default
- Basic user profile in database
- Done
Complex products:
Add episodic memory.
- Where user history is core value
- Where remembering past decisions matters
- Where the AI needs more than just facts
Implementation Complexity
| Type | Storage | Complexity | Cost | When |
|---|---|---|---|---|
| Short-term | Context window | Built-in | Included | Always |
| Long-term | Database | Simple | Low | Always |
| Episodic | Vector DB | Medium | Medium | If history is value |
12. Evaluation-Driven Development
The Validation Problem
You change a prompt. It feels better. But does it actually work better?
If your answer is "it feels better," you're guessing.
How Evaluation-Driven Development Works
Before you write a prompt, write a test.
- Define your test set: 20–50 representative inputs
- Define good output: What does a correct answer look like?
- Score the outputs: Either manually, or use another LLM as a judge
- Compare scores: Version 2 scores 14% higher on accuracy than version 1
Now you're not guessing. You know version 2 is better because the numbers say so.
The Discipline
Never ship a prompt change without running evals.
It takes 10 minutes to set up. It saves hours of debugging production regressions.
Simple Setup
Step 1: Create your test set
Step 2: Define your scoring rubric
Manual scoring:
- 1: Fails all criteria
- 2: Meets 1 criteria
- 3: Meets 2 criteria
- 4: Meets all criteria
Or use an LLM as judge:
Step 3: Run your baseline
Run all 20 test cases through your current prompt. Record scores.
Average score: 3.2/4.0
Step 4: Try a new prompt
Change something (add context, adjust tone, change format).
Run the same 20 test cases.
New average: 3.6/4.0
Step 5: Decide
Is 3.6 better enough to ship? Probably. You have evidence.
Why This Matters
You're not shipping based on hunches. You're shipping based on measurements.
When something breaks in production, you can debug it: "This input type scores 1.2 on the eval set, but this input type scores 3.8. What's different?"
13. Fine-Tuning vs RAG vs Prompt Engineering
Three Ways to Make an LLM Smarter
Only one is right for your situation. Most people pick wrong.
Start with Prompt Engineering
Always. It's free, fast, and often enough.
If you can write a prompt that produces the output you want consistently, stop here.
This is prompt engineering. It's powerful. Most people leave this step too early and jump to fine-tuning.
When to move past it: When you've written a really good prompt and it still doesn't work reliably for your use case.
Use RAG If You Have a Knowledge Problem
The model doesn't have the right knowledge. It doesn't know your docs. It doesn't know your company's specific data. It doesn't know recent events.
You're not changing the model's behavior. You're feeding it better context.
RAG is about retrieval, not training.
When to use: When the model's problem is "it doesn't have enough context" not "it doesn't know how to behave."
Use Fine-Tuning If You Have a Behavior Problem
You need to change how the model responds, not just what it knows.
Different tone. Different output format. Different reasoning style that prompts can't reliably produce.
Example: You want your AI to:
- Always respond in a specific format
- Never suggest competitors
- Always cite sources in a particular way
- Adopt a very specific tone (e.g., Shakespearean, extremely formal)
If prompt engineering and RAG can't make it do this consistently, fine-tune.
Cost: Money (training), time (data collection), complexity (managing a custom model).
The Decision Tree
If you answer "maybe it needs fine-tuning," you probably don't need it yet.
Fine-tuning is expensive. Do it only when you're sure.
The Real Numbers
| Approach | Time to Deploy | Cost | Maintenance |
|---|---|---|---|
| Prompt engineering | 1 hour | $0 | Low |
| RAG | 1 day | $0–$100 | Medium |
| Fine-tuning | 1 week | $100–$500+ | High |
Start cheap. Move up only when necessary.
14. Surviving Rate Limits
What Happens
Too many requests in a window. You get a 429 error. Without handling, every user at once sees an error. Cascading failure.
This is predictable. Plan for it.
Response 1: Exponential Backoff
Wait 1 second. Retry. Fail again? Wait 2 seconds. Retry. 4 seconds. 8 seconds.
Most transient rate limits clear in under 10 seconds.
This works for transient issues. But it blocks the user while waiting.
Response 2: Request Queue
New requests go into a queue. A worker pulls from the queue at a controlled rate, staying under the limit.
Users get slightly slower responses instead of errors.
The worker never exceeds rate limits. The queue absorbs spikes.
Example: Limit yourself to 10 requests per second even if the API allows 100. When traffic spikes, users wait in queue instead of failing.
Response 3: Fallback Models
Primary model hits rate limit? Route to a different provider.
Users never see the limit. Your architecture quietly switches lanes.
Best Practices
- Monitor your usage. Before you hit the limit, you're at 80% of it.
- Set your own limit lower. If the API allows 1000/min, limit yourself to 800. Leave headroom.
- Implement all three. Backoff for transient issues. Queue for sustained load. Fallback for provider problems.
- Alert early. When you hit 70% usage, alert your team. Investigate why.
Conclusion: Building for Production
The difference between an MVP that dies and one that scales isn't the idea. It's the architecture.
Build with these patterns from day one:
- 3-layer architecture: Gateway → Processing → Intelligence
- Circuit breakers and fallbacks: Plan for failure
- Monitoring and feedback loops: See what's happening
- Evaluation-driven development: Don't guess
- Right tool for the job: Use Haiku for simple tasks, not Sonnet
- Memory systems: Remember context across sessions
- Graceful degradation: Slow is better than broken
These are the difference between "it worked in the demo" and "it works at scale."
Quick Reference Checklists
Before Launch Checklist
- Error handling for all API calls
- Backup model configured
- Monitoring/logging in place
- User feedback mechanism (thumbs down)
- Rate limit handling (backoff, queue, or fallback)
- Latency budget tested
- Cost analysis run
- Eval set created for your core prompts
Production Readiness
- Circuit breaker pattern implemented
- Cache strategy defined
- Long-term memory system in place (if needed)
- Observability dashboard set up
- Fallback provider configured
- Cost alerts set to trigger at thresholds
- Docs on how to investigate failures