If you want to understand how an AI coding agent actually works at the systems level, forget the marketing pages. Open src/query.ts. At 1,729 lines, it is the single largest file in Claude Code's codebase, and it is where every user message enters, every tool call is dispatched, every error is recovered from, and every response eventually exits. This file is the beating heart of the entire system.
Extensive reading through this file and the surrounding modules reveals a surprisingly honest piece of production engineering. Not a clean-room research prototype, but a system that has clearly been shaped by thousands of real-world failure modes. The following sections walk through it.
The query function is an async generator that takes a user message and yields a stream of events back to the caller. The signature tells you a lot about the architecture right away:
// src/query.ts:219-228
export async function* query(
params: QueryParams,
): AsyncGenerator<
| StreamEvent
| RequestStartEvent
| Message
| TombstoneMessage
| ToolUseSummaryMessage,
Terminal
>The return type is a union of five different event types. The Terminal return value (as opposed to yielded values) carries the reason the loop exited: 'completed', 'aborted_streaming', 'aborted_tools', 'max_turns', 'prompt_too_long', 'model_error', 'image_error', 'hook_stopped', or 'stop_hook_prevented'. Each of those exit paths was added because of a real bug in production.
The outer query() function is actually just a thin wrapper that delegates to queryLoop() and does lifecycle bookkeeping for consumed commands:
// src/query.ts:229-239
const consumedCommandUuids: string[] = []
const terminal = yield* queryLoop(params, consumedCommandUuids)
for (const uuid of consumedCommandUuids) {
notifyCommandLifecycle(uuid, 'completed')
}
return terminalThis separation exists so command lifecycle notifications only fire on successful completion -- not on throws or .return() calls that abort the generator.
The queryLoop function is one giant while (true) loop. Each iteration represents one round-trip to the Claude API. The loop continues as long as the model keeps requesting tool calls.
The loop carries a mutable State object between iterations:
// src/query.ts:204-217
type State = {
messages: Message[]
toolUseContext: ToolUseContext
autoCompactTracking: AutoCompactTrackingState | undefined
maxOutputTokensRecoveryCount: number
hasAttemptedReactiveCompact: boolean
maxOutputTokensOverride: number | undefined
pendingToolUseSummary: Promise<ToolUseSummaryMessage | null> | undefined
stopHookActive: boolean | undefined
turnCount: number
transition: Continue | undefined
}The transition field is quietly brilliant. It records why the previous iteration continued -- 'next_turn', 'reactive_compact_retry', 'max_output_tokens_recovery', 'collapse_drain_retry', 'stop_hook_blocking', 'max_output_tokens_escalate', or 'token_budget_continuation'. This means the current iteration can make different decisions based on what happened last time. For example, if a collapse_drain_retry already fired and the API still returned a prompt-too-long error, the loop knows to fall through to reactive compact instead of draining again.
Each iteration of the while (true) loop follows this sequence:
-
Pre-processing: Apply tool result budgets, snip compaction, microcompact, context collapse, and auto-compact to the message history. This is where the system fights context window pressure before ever hitting the API.
-
API Call: Stream the response from Claude via
deps.callModel(). While streaming, tool use blocks are detected and (with streaming tool execution enabled) immediately dispatched to theStreamingToolExecutor. -
Post-streaming: Handle withheld errors (prompt-too-long, max-output-tokens, media size errors), execute post-sampling hooks, check for user abort.
-
Tool execution: Either drain remaining results from the
StreamingToolExecutoror fall back torunTools()fromtoolOrchestration.ts. -
Attachment injection: Add memory attachments, file change notifications, skill discovery results, and queued command outputs.
-
Continue or return: Build the next
Stateandcontinue, orreturnaTerminal.
The key insight is that steps 2 and 4 can overlap when streaming tool execution is enabled. The model is still streaming tokens while the first tool calls are already running. This is a major latency win.
The inner streaming loop (starting around line 659) iterates over messages from the API:
// src/query.ts:659-708 (simplified)
for await (const message of deps.callModel({
messages: prependUserContext(messagesForQuery, userContext),
systemPrompt: fullSystemPrompt,
thinkingConfig: toolUseContext.options.thinkingConfig,
tools: toolUseContext.options.tools,
signal: toolUseContext.abortController.signal,
options: { model: currentModel, fallbackModel, ... },
})) {
// Handle streaming fallback (model switch mid-stream)
// Backfill tool_use inputs for SDK observers
// Withhold recoverable errors
// Track assistant messages and tool_use blocks
// Feed tool_use blocks to StreamingToolExecutor
// Yield completed tool results mid-stream
}One detail worth appreciating: the code carefully clones assistant messages before yielding them to SDK consumers, but keeps the originals for the API message history. This is because backfillObservableInput adds derived fields (like expanded file paths) that SDK consumers want to see, but sending them back to the API would break prompt cache hits due to byte mismatches. That level of cache awareness is threaded throughout the entire system.
The StreamingToolExecutor (src/services/tools/StreamingToolExecutor.ts) is one of the more impressive pieces of the codebase. It executes tools while the model is still streaming its response.
The executor tracks each tool with a status: 'queued' -> 'executing' -> 'completed' -> 'yielded'. The concurrency rules are:
// src/services/tools/StreamingToolExecutor.ts:129-135
private canExecuteTool(isConcurrencySafe: boolean): boolean {
const executingTools = this.tools.filter(t => t.status === 'executing')
return (
executingTools.length === 0 ||
(isConcurrencySafe && executingTools.every(t => t.isConcurrencySafe))
)
}Translation: concurrent-safe tools (reads, searches) can run in parallel with each other. Non-concurrent tools (writes, bash commands) get exclusive access. This is determined per-tool via the isConcurrencySafe() method on each tool definition.
When a new tool_use block arrives from the stream, it's immediately added to the executor:
// src/query.ts:838-844
if (streamingToolExecutor && !toolUseContext.abortController.signal.aborted) {
for (const toolBlock of msgToolUseBlocks) {
streamingToolExecutor.addTool(toolBlock, message)
}
}And completed results are polled during the same streaming loop:
// src/query.ts:851-862
for (const result of streamingToolExecutor.getCompletedResults()) {
if (result.message) {
yield result.message
toolResults.push(...)
}
}Here's where things get really thoughtful. When a Bash command fails, it aborts sibling tools via a dedicated siblingAbortController:
// src/services/tools/StreamingToolExecutor.ts:357-364
if (tool.block.name === BASH_TOOL_NAME) {
this.hasErrored = true
this.erroredToolDescription = this.getToolDescription(tool)
this.siblingAbortController.abort('sibling_error')
}But only Bash errors trigger this. Read, WebFetch, and other independent tools can fail without affecting siblings. The comment explains the reasoning: "Bash commands often have implicit dependency chains (e.g. mkdir fails -> subsequent commands pointless). Read/WebFetch/etc are independent -- one failure shouldn't nuke the rest." This kind of nuanced error handling is what separates production code from demos.
When streaming tool execution is disabled (or for older code paths), the system falls back to runTools() from src/services/tools/toolOrchestration.ts. This uses a simpler batch model:
// src/services/tools/toolOrchestration.ts:91-116
function partitionToolCalls(
toolUseMessages: ToolUseBlock[],
toolUseContext: ToolUseContext,
): Batch[] {
return toolUseMessages.reduce((acc: Batch[], toolUse) => {
const tool = findToolByName(toolUseContext.options.tools, toolUse.name)
const isConcurrencySafe = /* ... parse and check ... */
if (isConcurrencySafe && acc[acc.length - 1]?.isConcurrencySafe) {
acc[acc.length - 1]!.blocks.push(toolUse)
} else {
acc.push({ isConcurrencySafe, blocks: [toolUse] })
}
return acc
}, [])
}It partitions tool calls into consecutive batches of concurrent-safe or serial tools, then executes each batch appropriately. The max concurrency is configurable via CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY (defaulting to 10).
The error recovery in query.ts is extensive. There are at least seven distinct recovery mechanisms, each addressing a different failure mode.
When the primary model is unavailable (rate limited, overloaded), a FallbackTriggeredError triggers a mid-stream model switch:
// src/query.ts:894-950
} catch (innerError) {
if (innerError instanceof FallbackTriggeredError && fallbackModel) {
currentModel = fallbackModel
attemptWithFallback = true
// Clear all state from the failed attempt
assistantMessages.length = 0
toolResults.length = 0
// Discard streaming tool executor and create fresh one
if (streamingToolExecutor) {
streamingToolExecutor.discard()
streamingToolExecutor = new StreamingToolExecutor(...)
}
// Strip thinking signatures (model-bound, would cause 400s)
messagesForQuery = stripSignatureBlocks(messagesForQuery)
continue
}
}The stripSignatureBlocks call is interesting -- thinking block signatures are cryptographically bound to the model that produced them. Replaying a Capybara thinking block to Opus would cause a 400 error. This kind of edge case only surfaces in production.
When the model hits its output token limit mid-response, the system does an escalating retry:
- First: If using the default 8k cap, retry at 64k (
ESCALATED_MAX_TOKENS) with no meta message -- the same request, just a bigger output budget. - Then: Inject a meta message telling the model to resume: "Output token limit hit. Resume directly -- no apology, no recap." Up to 3 retries (
MAX_OUTPUT_TOKENS_RECOVERY_LIMIT). - Finally: Surface the error to the user.
The recovery message is carefully worded to prevent the model from wasting tokens apologizing or summarizing what it already said. That wording was almost certainly tuned through painful iteration.
When the API returns a prompt-too-long error, the system doesn't just fail. It first tries context collapse (draining staged collapses), then falls back to reactive compaction -- summarizing the conversation to fit within the context window. This all happens transparently:
// src/query.ts:1119-1166
if ((isWithheld413 || isWithheldMedia) && reactiveCompact) {
const compacted = await reactiveCompact.tryReactiveCompact({
hasAttempted: hasAttemptedReactiveCompact,
messages: messagesForQuery,
cacheSafeParams: { systemPrompt, userContext, systemContext, ... },
})
if (compacted) {
const postCompactMessages = buildPostCompactMessages(compacted)
for (const msg of postCompactMessages) yield msg
state = { ...state, messages: postCompactMessages,
hasAttemptedReactiveCompact: true,
transition: { reason: 'reactive_compact_retry' } }
continue
}
}The hasAttemptedReactiveCompact flag prevents infinite loops -- if compaction fails to bring the context under the limit, the error surfaces on the next iteration rather than compacting again.
When a streaming fallback occurs (model switch mid-response), orphaned assistant messages -- including their thinking blocks with model-bound cryptographic signatures -- must be removed. The system yields "tombstone" messages:
// src/query.ts:713-727
for (const msg of assistantMessages) {
yield { type: 'tombstone' as const, message: msg }
}
assistantMessages.length = 0SDK consumers and the UI interpret tombstones as "delete this message from the transcript." Without this, you'd get cryptic "thinking blocks cannot be modified" API errors on the retry.
The AgentTool (src/tools/AgentTool/runAgent.ts) spawns child agents that run their own independent query() loops. Each sub-agent gets:
- Its own
agentId-- a unique identifier for transcript recording, Perfetto tracing, and analytics scoping - An isolated
ToolUseContext-- created viacreateSubagentContext()fromsrc/utils/forkedAgent.ts, which clones the parent's file state cache and content replacement state - Filtered tools -- resolved via
resolveAgentTools(), which applies agent-specific tool allow/deny lists - Its own MCP servers -- agents can declare MCP servers in their frontmatter, which are connected at spawn and cleaned up on exit
- A separate abort controller -- so aborting a sub-agent doesn't kill the parent
The sub-agent runs query() directly:
// src/tools/AgentTool/runAgent.ts:15
import { query } from '../../query.js'This is the same query function. Sub-agents are full-fidelity agent loops, not some degraded inner mode. They get auto-compaction, streaming tool execution, error recovery -- everything. The only differences are scoping and isolation.
Agent definitions support several isolation modes:
- Default: Shares the parent's working directory and file system
- Worktree: Operates in a separate git worktree for parallel branch work
- Sub-agents can be synchronous (blocks the parent until complete) or asynchronous (runs in the background, reports results later)
The transcript recording is particularly clever. Each sub-agent's messages are recorded to a sidechain file under subagents/, and the parent can reconstruct the child's content replacement state from these records on resume. This means you can /resume a session and sub-agents pick up where they left off.
Context window management is arguably the hardest problem in long-running agent sessions. Claude Code attacks it with four distinct mechanisms, applied in sequence on every loop iteration:
The lightest touch. Removes old, low-value messages from the middle of the conversation. Runs first because it's cheap (no API call) and might free enough tokens to skip heavier compaction.
// src/query.ts:401-410
if (feature('HISTORY_SNIP')) {
const snipResult = snipModule!.snipCompactIfNeeded(messagesForQuery)
messagesForQuery = snipResult.messages
snipTokensFreed = snipResult.tokensFreed
}Compresses individual tool results that are too large, replacing them with summaries. This runs before auto-compact and can use cached results (the CACHED_MICROCOMPACT feature gate).
A projection-based system that archives groups of messages into summaries. Unlike auto-compact, collapses are reversible -- they're stored in a separate "collapse store" and replayed via projectView() on each iteration. The comment in query.ts explains: "the collapsed view is a read-time projection over the REPL's full history."
The nuclear option. When token usage exceeds the threshold (context window minus a ~13k buffer), the entire conversation is summarized into a compact representation:
// src/services/compact/autoCompact.ts:72-91
export function getAutoCompactThreshold(model: string): number {
const effectiveContextWindow = getEffectiveContextWindowSize(model)
const autocompactThreshold =
effectiveContextWindow - AUTOCOMPACT_BUFFER_TOKENS
return autocompactThreshold
}The compaction itself runs as a forked agent (it calls the Claude API to generate the summary), which is why querySource === 'compact' gets special treatment throughout the codebase -- you don't want the compact agent to trigger its own compaction check.
Auto-compact tracking includes a circuit breaker: after MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES (3) consecutive failures, it stops trying. This prevents runaway API spend from sessions that are irrecoverably stuck.
The queryTracking field on ToolUseContext tracks the chain of API calls within a single user turn:
// src/query.ts:347-355
const queryTracking = toolUseContext.queryTracking
? {
chainId: toolUseContext.queryTracking.chainId,
depth: toolUseContext.queryTracking.depth + 1,
}
: {
chainId: deps.uuid(),
depth: 0,
}Every API call in a single turn shares the same chainId but gets an incrementing depth. This is used for analytics (understanding how many round-trips a typical task takes) and for debugging (correlating log entries across a multi-step tool-use chain).
The maxTurns parameter caps how many iterations the loop can run:
// src/query.ts:1704-1712
if (maxTurns && nextTurnCount > maxTurns) {
yield createAttachmentMessage({
type: 'max_turns_reached',
maxTurns,
turnCount: nextTurnCount,
})
return { reason: 'max_turns', turnCount: nextTurnCount }
}Sub-agents typically have lower maxTurns than the main loop, preventing runaway agent spawns from consuming unbounded resources.
Between iterations, the loop also handles:
- Tool refreshing: MCP servers that connected mid-query become available via
refreshTools() - Command queue draining: Slash commands and task notifications queued during tool execution are injected as attachments
- Memory prefetch consumption: Relevant CLAUDE.md files prefetched at turn start are injected if the prefetch has settled
- Skill discovery injection: Prefetched skill matches are attached for the next iteration
- Task summary generation: For the
claude pscommand, periodic summaries are generated in the background
Cost tracking is split across two modules: src/cost-tracker.ts for the accounting logic and src/bootstrap/state.ts for the global state.
The core model is per-model usage tracking:
// src/cost-tracker.ts:71-80
type StoredCostState = {
totalCostUSD: number
totalAPIDuration: number
totalAPIDurationWithoutRetries: number
totalToolDuration: number
totalLinesAdded: number
totalLinesRemoved: number
lastDuration: number | undefined
modelUsage: { [modelName: string]: ModelUsage } | undefined
}Usage accumulates via accumulateUsage() in src/services/api/claude.ts, which is called after every API response. The QueryEngine (src/QueryEngine.ts) maintains a totalUsage field that aggregates across all turns in a session:
// src/QueryEngine.ts:189
private totalUsage: NonNullableUsageCost state persists to the project config file and is keyed by session ID, so resuming a session picks up the previous cost totals. The getStoredSessionCosts() function checks that the session ID matches before returning stored costs -- a different session's costs are discarded.
The taskBudget feature adds a hard spending cap. The remaining budget is tracked across compaction boundaries -- when a compaction fires, the system captures the pre-compact context window size and subtracts it from the remaining budget, because the server can no longer see the summarized-away history:
// src/query.ts:508-515
if (params.taskBudget) {
const preCompactContext =
finalContextTokensFromLastResponse(messagesForQuery)
taskBudgetRemaining = Math.max(
0,
(taskBudgetRemaining ?? params.taskBudget.total) - preCompactContext,
)
}