Plan: Resilient Hot-Reload Recovery for Decopilot Runs
Context
When the dev server hot-reloads (bun --hot) mid-stream, the Claude Code agent child process is killed, the in-memory RunRegistry is wiped, and the NATS JetStream buffer (memory-only) is lost. But the thread stays in_progress in the DB because stopAll() fires FORCE_FAIL as fire-and-forget (async reactor may not complete before the process is replaced).
The frontend detects isRunInProgress, tries /attach which returns 204 (no run in registry), retries 3 times, then gives up. The user sees "No response was generated" + "Run in progress" stuck forever — the only escape is manually clicking cancel.
Answer to the question: No, the agent is NOT still running after hot reload. The child process is killed. But the Claude Code SDK stores conversation history on disk (~/.claude/projects/), so thread context survives restarts. We can leverage this by sending a "continue" message with context.
Approach: Detect Ghost + Auto-Continue
- Server startup: Sweep DB for ghost threads (
in_progress with no run in registry) and mark them as interrupted
- Frontend: When ghost detected, replace "No response was generated" with a "Continue" button that sends a contextual resume message
Changes
1. Add listByStatus() to thread storage
File: apps/mesh/src/storage/threads.ts + apps/mesh/src/storage/ports.ts
Add method to find all ghost threads on startup:
listByStatus(status: string): Promise<Array<{ id: string; organization_id: string }>>
// SELECT id, organization_id FROM thread WHERE status = $1
2. Server startup ghost-run sweep
File: apps/mesh/src/api/app.ts (~line 318, after RunRegistry creation)
After creating the RunRegistry, run an async sweep:
// Fire-and-forget: clean up any threads left in_progress from previous process
threadStorage.listByStatus("in_progress").then(async (ghosts) => {
for (const ghost of ghosts) {
await threadStorage.update(ghost.id, ghost.organization_id, { status: "failed" });
sseHub.emit(ghost.organization_id, createDecopilotThreadStatusEvent(ghost.id, "failed"));
sseHub.emit(ghost.organization_id, createDecopilotFinishEvent(ghost.id, "failed"));
console.warn("[decopilot] Cleaned up ghost run", { threadId: ghost.id });
}
}).catch(err => console.error("[decopilot] Ghost sweep failed", err));
This runs once on startup, non-blocking. Any thread stuck as in_progress without a corresponding run is a ghost.
3. Frontend: auto-cancel on resume failure (fast ghost resolution)
File: apps/mesh/src/web/components/chat/chat-provider.tsx (TaskStreamManager, line ~129)
When tryResumeStream fails (which means /attach returned 204), instead of retrying 3 times with 30s polling, immediately call the cancel endpoint on the first failure:
// In the .catch handler after resume fails:
chatStore.cancelRun(); // triggers ghost detection server-side (routes.ts:391-413)
The cancel endpoint already has ghost detection that force-fails the thread and emits SSE events.
4. "Continue" button in EmptyAssistantState
File: apps/mesh/src/web/components/chat/message/assistant.tsx (line 370)
Replace the static EmptyAssistantState with a component that shows a "Continue" button when the thread was interrupted. The button sends a contextual message like:
"The previous run was interrupted by a server restart. Please continue where you left off. Here's a brief summary of what was being done: [last user message content]"
Implementation:
EmptyAssistantState needs access to: whether this is the last pair, the thread status (failed), and the user's last message
- Pass
isLast and the user message from MessagePair props down to MessageAssistant
- When
isLast && message === null && !isLoading && thread.status === "failed":
- Show "Run was interrupted" text
- Render a "Continue" button that calls
chatStore.sendMessage() with a pre-built continuation prompt
- The prompt includes the last user message text for context
function EmptyAssistantState({ isLast, userMessage }: { isLast: boolean; userMessage?: ChatMessage }) {
const threadStatus = useChatStore(s => {
const thread = s.threads.find(t => t.id === s.activeThreadId);
return thread?.status;
});
// Ghost/interrupted run — show continue button
if (isLast && threadStatus === "failed" && userMessage) {
const userText = userMessage.parts
?.filter(p => p.type === "text")
.map(p => p.text)
.join(" ")
.slice(0, 200);
return (
<div className="flex flex-col gap-2 py-2">
<div className="text-[14px] text-muted-foreground/60">
Run was interrupted by a server restart
</div>
<button
className="text-[13px] text-primary hover:underline self-start"
onClick={() => {
chatStore.sendMessage({
parts: [{ type: "text", text: `The previous run was interrupted. Please continue where you left off. The original request was: "${userText}"` }],
});
}}
>
Continue conversation
</button>
</div>
);
}
return (
<div className="text-[14px] text-muted-foreground/60 py-2">
No response was generated
</div>
);
}
Prop threading:
MessagePair component (pair.tsx:59) already has pair.user — pass it to MessageAssistant
MessageAssistant passes it to EmptyAssistantState when rendering the empty state
5. Pass user message through component tree
File: apps/mesh/src/web/components/chat/message/pair.tsx (line 89)
Add userMessage prop to MessageAssistant:
<MessageAssistant
message={pair.assistant}
userMessage={pair.user} // NEW
status={status}
isLast={isLastPair}
isPlanMode={isPlanMode}
/>
File: apps/mesh/src/web/components/chat/message/assistant.tsx
Add userMessage to MessageAssistant props and pass it to EmptyAssistantState.
Files to modify
| File |
Change |
apps/mesh/src/storage/ports.ts |
Add listByStatus() to ThreadStoragePort |
apps/mesh/src/storage/threads.ts |
Implement listByStatus() query |
apps/mesh/src/api/app.ts |
Add startup ghost sweep (~line 318) |
apps/mesh/src/web/components/chat/chat-provider.tsx |
Auto-cancel on first resume failure |
apps/mesh/src/web/components/chat/message/assistant.tsx |
"Continue" button in EmptyAssistantState |
apps/mesh/src/web/components/chat/message/pair.tsx |
Pass userMessage to MessageAssistant |
Edge cases
- Multiple ghosts: Startup sweep handles all in one pass
- Concurrent hot reloads: Force-fail is idempotent (
in_progress -> failed transition only)
- SSE reconnect: EventSource auto-reconnects after restart; ghost sweep SSE events emit after hub is ready
- Partial messages: Any messages saved at 5-step checkpoints survive; the gap between last checkpoint and crash is lost (acceptable for dev)
- Non-interrupted failures: The "Continue" button only shows when
isLast && message === null && threadStatus === "failed" — regular failures with partial responses won't trigger it (they have content)
- Claude Code memory: The SDK stores session history at
~/.claude/projects/, so when the user sends the continue message, the new agent instance can load thread history from both our DB and the SDK's session files
Verification
- Start a Claude Code run that takes time (e.g., "search the codebase for all TODO comments")
- While streaming, save a file to trigger hot reload
- Expected: within 1-2s, the thread transitions to "failed"
- UI shows "Run was interrupted by a server restart" + "Continue conversation" button
- Click "Continue" — sends a message with context, agent picks up where it left off
Future: True Resume (out of scope for now)
The Claude Agent SDK supports resume: sessionId + resumeSessionAt: messageUuid. A future enhancement could:
- Store a unique session UUID per thread (instead of
session_id: "chat")
- On restart, re-spawn the agent with
resume to continue from where it left off
- Re-stream the resumed output to the client
This is complex (duplicate content detection, partial tool state, session file integrity) and better suited as a production feature with proper testing.
Plan: Resilient Hot-Reload Recovery for Decopilot Runs
Context
When the dev server hot-reloads (
bun --hot) mid-stream, the Claude Code agent child process is killed, the in-memory RunRegistry is wiped, and the NATS JetStream buffer (memory-only) is lost. But the thread staysin_progressin the DB becausestopAll()firesFORCE_FAILas fire-and-forget (async reactor may not complete before the process is replaced).The frontend detects
isRunInProgress, tries/attachwhich returns 204 (no run in registry), retries 3 times, then gives up. The user sees "No response was generated" + "Run in progress" stuck forever — the only escape is manually clicking cancel.Answer to the question: No, the agent is NOT still running after hot reload. The child process is killed. But the Claude Code SDK stores conversation history on disk (
~/.claude/projects/), so thread context survives restarts. We can leverage this by sending a "continue" message with context.Approach: Detect Ghost + Auto-Continue
in_progresswith no run in registry) and mark them as interruptedChanges
1. Add
listByStatus()to thread storageFile:
apps/mesh/src/storage/threads.ts+apps/mesh/src/storage/ports.tsAdd method to find all ghost threads on startup:
2. Server startup ghost-run sweep
File:
apps/mesh/src/api/app.ts(~line 318, after RunRegistry creation)After creating the RunRegistry, run an async sweep:
This runs once on startup, non-blocking. Any thread stuck as
in_progresswithout a corresponding run is a ghost.3. Frontend: auto-cancel on resume failure (fast ghost resolution)
File:
apps/mesh/src/web/components/chat/chat-provider.tsx(TaskStreamManager, line ~129)When
tryResumeStreamfails (which means/attachreturned 204), instead of retrying 3 times with 30s polling, immediately call the cancel endpoint on the first failure:The cancel endpoint already has ghost detection that force-fails the thread and emits SSE events.
4. "Continue" button in EmptyAssistantState
File:
apps/mesh/src/web/components/chat/message/assistant.tsx(line 370)Replace the static
EmptyAssistantStatewith a component that shows a "Continue" button when the thread was interrupted. The button sends a contextual message like:Implementation:
EmptyAssistantStateneeds access to: whether this is the last pair, the thread status (failed), and the user's last messageisLastand the user message fromMessagePairprops down toMessageAssistantisLast && message === null && !isLoading && thread.status === "failed":chatStore.sendMessage()with a pre-built continuation promptProp threading:
MessagePaircomponent (pair.tsx:59) already haspair.user— pass it toMessageAssistantMessageAssistantpasses it toEmptyAssistantStatewhen rendering the empty state5. Pass user message through component tree
File:
apps/mesh/src/web/components/chat/message/pair.tsx(line 89)Add
userMessageprop toMessageAssistant:File:
apps/mesh/src/web/components/chat/message/assistant.tsxAdd
userMessagetoMessageAssistantprops and pass it toEmptyAssistantState.Files to modify
apps/mesh/src/storage/ports.tslistByStatus()toThreadStoragePortapps/mesh/src/storage/threads.tslistByStatus()queryapps/mesh/src/api/app.tsapps/mesh/src/web/components/chat/chat-provider.tsxapps/mesh/src/web/components/chat/message/assistant.tsxEmptyAssistantStateapps/mesh/src/web/components/chat/message/pair.tsxuserMessagetoMessageAssistantEdge cases
in_progress->failedtransition only)isLast && message === null && threadStatus === "failed"— regular failures with partial responses won't trigger it (they have content)~/.claude/projects/, so when the user sends the continue message, the new agent instance can load thread history from both our DB and the SDK's session filesVerification
Future: True Resume (out of scope for now)
The Claude Agent SDK supports
resume: sessionId+resumeSessionAt: messageUuid. A future enhancement could:session_id: "chat")resumeto continue from where it left offThis is complex (duplicate content detection, partial tool state, session file integrity) and better suited as a production feature with proper testing.