What patterns do you use for AI agent error recovery? #1341

jingchang0623-crypto · 2026-04-07T00:09:31Z

jingchang0623-crypto
Apr 7, 2026

When building production AI agents, error handling is one of those topics that is harder than it looks. A simple retry is not enough when you are dealing with API rate limits, tool failures, and network timeouts.

What I am curious about:

Retry strategies: exponential backoff? circuit breakers? max retries?
Fallback models: If Claude fails, do you fall back to a smaller model?
Tool failure handling: how do you recover gracefully?
State consistency: when a long task fails mid-way, how do you handle cleanup?

At miaoquai.com, we use a tiered approach. Would love to hear what patterns others are using - especially for multi-agent systems where one failure can cascade.

jingchang0623-crypto · 2026-04-21T12:01:59Z

jingchang0623-crypto
Apr 21, 2026
Author

Great discussion topic! Error handling in production agents is indeed trickier than it looks.

From running miaoquai.com's AI agent team, here are some patterns that saved us from midnight disasters:

Retry Strategies We Use:

Exponential backoff with jitter (prevent thundering herd)
Circuit breaker for external APIs - once we see 5 consecutive failures, we trip the breaker for 30s
Crucially: Separate retry configs for idempotent vs non-idempotent operations

Tool Failure Recovery:

We log every tool call with a "before" snapshot
On failure, we have a rollback hook that can revert file changes
For network tools, we implement a "best effort" flag - some content can be delivered later

State Consistency (the painful one):

We use session-based checkpoints every N messages
MEMORY.md pattern: each agent writes its state to a shared memory file
When cascade failures happen, we can replay from last checkpoint

Real war story: Had a cron job that was supposed to post to Discord at 22:00. Network timeout + retry storm = 50 duplicate posts at 22:05. Now we use idempotency keys and "already posted" guards.

Wrote more about our debugging adventures: https://miaoquai.com/stories/cron-task-midnight-disaster.html

The cascade failure problem is real - one agent goes down, takes the whole pipeline with it. We ended up with isolated agent sessions with their own sandboxes.

What's been your biggest "learning moment" with error recovery?

0 replies

jingchang0623-crypto · 2026-04-21T12:10:08Z

jingchang0623-crypto
Apr 21, 2026
Author

From 95 days in production: Error recovery is the real "production tax"

We run 5 autonomous AI agents 24/7 at miaoquai.com, and here is what we learned the hard way:

The "Cascade Nightmare" Pattern

Our biggest learning: in multi-agent systems, error recovery is not about fixing one agent — it is about preventing dominoes from falling.

Real example: Our CRON agent failed silently at 3 AM. The content agent depended on the CRON output. The Discord agent depended on the content agent. Three hours later, 23 scheduled tasks had failed in a chain. We woke up to a ghost town.

Our fix: Circuit breakers at every agent boundary. If Agent A fails, Agent B gets a "degraded mode" signal instead of garbage input.

Our Tiered Strategy (after many tears)

Layer	Strategy	Example
API calls	Exponential backoff + fallback model	Claude -> GPT-4o-mini for non-critical
Tool failures	Graceful degradation + human escalation	MCP server timeout -> skip tool, log warning
State cleanup	Checkpoint before every major operation	Pre-flight snapshot of workspace state
Cascading	Circuit breaker at agent boundaries	Agent A down -> Agent B enters "standalone mode"

The Uncomfortable Truth

Most "error recovery" discussions assume the agent knows it failed. In our experience, the scariest failures are the silent ones — the agent thinks it succeeded, but the output is wrong. We call these "confidence failures."

Example: Our RSS agent once "successfully" posted 47 duplicate news items. Every API call returned 200 OK. The error was in the logic, not the infrastructure.

Our pattern for confidence failures: Post-execution validation. After every batch operation, a lightweight check confirms the output matches expectations (no duplicates, reasonable count, no obvious hallucinations).

Full war story of our CRON disaster: https://miaoquai.com/stories/cron-task-midnight-disaster.html

And the multi-agent coordination nightmare that taught us about cascading failures: https://miaoquai.com/stories/multi-agent-meeting-hell.html

The bottom line: error recovery is 30% code, 70% expecting things to fail in ways you never imagined.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What patterns do you use for AI agent error recovery? #1341

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What patterns do you use for AI agent error recovery? #1341

Uh oh!

jingchang0623-crypto Apr 7, 2026

Replies: 2 comments

Uh oh!

jingchang0623-crypto Apr 21, 2026 Author

Uh oh!

jingchang0623-crypto Apr 21, 2026 Author

From 95 days in production: Error recovery is the real "production tax"

The "Cascade Nightmare" Pattern

Our Tiered Strategy (after many tears)

The Uncomfortable Truth

jingchang0623-crypto
Apr 7, 2026

jingchang0623-crypto
Apr 21, 2026
Author

jingchang0623-crypto
Apr 21, 2026
Author