Skip to content

bug(gateguard): race condition in session state read-modify-write path #1441

@shenchangmin

Description

@shenchangmin

Follow-up to #1406, #1419, #1439. Splitting the state-layer concurrency concern into its own issue so it can be designed and reviewed independently of the hot-path performance fix that already landed via #1439.

Problem

scripts/hooks/gateguard-fact-force.js persists session state (checked array + last_active) via a classic load-modify-save pattern with no mutual exclusion:

function markChecked(key) {
  const state = loadState();                  // 1. read
  if (!state.checked.includes(key)) {
    state.checked.push(key);                  // 2. modify
    saveState(state);                         // 3. atomic file rename
  }
}

saveState() is atomic at the filesystem level (temp file + fs.renameSync), so we never see a corrupted state file. But the read-modify-write window itself is unprotected, and the hook runs as a fresh Node process per tool invocation — so there is no shared memory between concurrent hook processes to coordinate on.

Concurrent scenario

Two tool calls that share a CLAUDE_SESSION_ID fire nearly simultaneously (parallel tool use, or rapid-fire edits on the same session):

  1. Process A: loadState(){checked: [a], last_active: t0}
  2. Process B: loadState(){checked: [a], last_active: t0} (same snapshot)
  3. Process A: markChecked(b) → writes {checked: [a, b], last_active: t1}
  4. Process B: markChecked(c) → writes {checked: [a, c], last_active: t2} (b is gone)

User-visible effect: the file b that process A just gated appears un-checked to the next invocation, so Claude gets re-prompted with the Fact-Forcing gate for a file it already investigated. Not data loss in the strict sense, but a UX regression proportional to how often parallel tool calls land on the same session.

The READ_HEARTBEAT_MS throttle landed in #1439 narrows the read-path window (~99% of invocations are now pure reads), but the markChecked() path still exhibits the full race on every new gate, and the throttled heartbeat write in isChecked() still reproduces it when it does fire.

Candidate fixes

Three options, roughly ordered by engineering cost / platform portability trade-off:

1. flock-style advisory lock around read-modify-write

Wrap the entire loadState → mutate → saveState sequence in an exclusive lock on STATE_FILE (or a sibling .lock file). Portable options:

  • proper-lockfile — cross-platform, already handles stale locks, small footprint.
  • Raw fs.openSync(path, 'wx') sentinel file with retry — zero dependency, but requires reinventing stale-lock handling.
  • Shell out to flock on POSIX + powershell Wait-File on Windows — brittle, not recommended.

Pros: closes the race fully. Cons: new dependency or stale-lock logic; slight latency on every mutation.

2. Merge-style saves

Before fs.renameSync, re-read the current on-disk state one last time and merge it with the in-memory state: union of checked, max(last_active). Roughly:

function saveState(state) {
  const disk = loadStateRaw();                // non-throwing
  state.checked = Array.from(new Set([...(disk.checked || []), ...state.checked]));
  state.last_active = Math.max(disk.last_active || 0, state.last_active);
  // ... existing prune + atomic write
}

Pros: zero new dependencies, no lock contention, matches the "additive-only state" semantics this code already has. Cons: still not strictly serializable (two merges can race), but entries are never lost — just occasionally duplicated before the next prune, which is fine.

3. In-memory coordination via short-lived lease

Maintain a per-session lease file (e.g. state-<sid>.lease) that hook processes claim for, say, 100ms before mutating. Simpler than flock but less robust than merge-style saves on crash/ctrl-C.

Pros: cheap. Cons: needs correct cleanup; behaves worst-case the same as option 2 under crashes.

Recommendation

Option 2 (merge-style saves) feels like the right default for ECC's profile: no new dependency, tolerant of crashes, preserves the "additive entries + bounded pruning" semantics already in place. If/when GateGuard needs stronger invariants (e.g. session-level transactions across multiple files), revisit with option 1.

Severity

P2 — UX regression only, no data loss. Easy to reproduce under parallel tool calls but rare in typical interactive sessions.

Links

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions