Follow-up to #1406, #1419, #1439. Splitting the state-layer concurrency concern into its own issue so it can be designed and reviewed independently of the hot-path performance fix that already landed via #1439.
Problem
scripts/hooks/gateguard-fact-force.js persists session state (checked array + last_active) via a classic load-modify-save pattern with no mutual exclusion:
function markChecked(key) {
const state = loadState(); // 1. read
if (!state.checked.includes(key)) {
state.checked.push(key); // 2. modify
saveState(state); // 3. atomic file rename
}
}
saveState() is atomic at the filesystem level (temp file + fs.renameSync), so we never see a corrupted state file. But the read-modify-write window itself is unprotected, and the hook runs as a fresh Node process per tool invocation — so there is no shared memory between concurrent hook processes to coordinate on.
Concurrent scenario
Two tool calls that share a CLAUDE_SESSION_ID fire nearly simultaneously (parallel tool use, or rapid-fire edits on the same session):
- Process A:
loadState() → {checked: [a], last_active: t0}
- Process B:
loadState() → {checked: [a], last_active: t0} (same snapshot)
- Process A:
markChecked(b) → writes {checked: [a, b], last_active: t1}
- Process B:
markChecked(c) → writes {checked: [a, c], last_active: t2} (b is gone)
User-visible effect: the file b that process A just gated appears un-checked to the next invocation, so Claude gets re-prompted with the Fact-Forcing gate for a file it already investigated. Not data loss in the strict sense, but a UX regression proportional to how often parallel tool calls land on the same session.
The READ_HEARTBEAT_MS throttle landed in #1439 narrows the read-path window (~99% of invocations are now pure reads), but the markChecked() path still exhibits the full race on every new gate, and the throttled heartbeat write in isChecked() still reproduces it when it does fire.
Candidate fixes
Three options, roughly ordered by engineering cost / platform portability trade-off:
1. flock-style advisory lock around read-modify-write
Wrap the entire loadState → mutate → saveState sequence in an exclusive lock on STATE_FILE (or a sibling .lock file). Portable options:
proper-lockfile — cross-platform, already handles stale locks, small footprint.
- Raw
fs.openSync(path, 'wx') sentinel file with retry — zero dependency, but requires reinventing stale-lock handling.
- Shell out to
flock on POSIX + powershell Wait-File on Windows — brittle, not recommended.
Pros: closes the race fully. Cons: new dependency or stale-lock logic; slight latency on every mutation.
2. Merge-style saves
Before fs.renameSync, re-read the current on-disk state one last time and merge it with the in-memory state: union of checked, max(last_active). Roughly:
function saveState(state) {
const disk = loadStateRaw(); // non-throwing
state.checked = Array.from(new Set([...(disk.checked || []), ...state.checked]));
state.last_active = Math.max(disk.last_active || 0, state.last_active);
// ... existing prune + atomic write
}
Pros: zero new dependencies, no lock contention, matches the "additive-only state" semantics this code already has. Cons: still not strictly serializable (two merges can race), but entries are never lost — just occasionally duplicated before the next prune, which is fine.
3. In-memory coordination via short-lived lease
Maintain a per-session lease file (e.g. state-<sid>.lease) that hook processes claim for, say, 100ms before mutating. Simpler than flock but less robust than merge-style saves on crash/ctrl-C.
Pros: cheap. Cons: needs correct cleanup; behaves worst-case the same as option 2 under crashes.
Recommendation
Option 2 (merge-style saves) feels like the right default for ECC's profile: no new dependency, tolerant of crashes, preserves the "additive entries + bounded pruning" semantics already in place. If/when GateGuard needs stronger invariants (e.g. session-level transactions across multiple files), revisit with option 1.
Severity
P2 — UX regression only, no data loss. Easy to reproduce under parallel tool calls but rare in typical interactive sessions.
Links
Follow-up to #1406, #1419, #1439. Splitting the state-layer concurrency concern into its own issue so it can be designed and reviewed independently of the hot-path performance fix that already landed via #1439.
Problem
scripts/hooks/gateguard-fact-force.jspersists session state (checkedarray +last_active) via a classic load-modify-save pattern with no mutual exclusion:saveState()is atomic at the filesystem level (temp file +fs.renameSync), so we never see a corrupted state file. But the read-modify-write window itself is unprotected, and the hook runs as a fresh Node process per tool invocation — so there is no shared memory between concurrent hook processes to coordinate on.Concurrent scenario
Two tool calls that share a
CLAUDE_SESSION_IDfire nearly simultaneously (parallel tool use, or rapid-fire edits on the same session):loadState()→{checked: [a], last_active: t0}loadState()→{checked: [a], last_active: t0}(same snapshot)markChecked(b)→ writes{checked: [a, b], last_active: t1}markChecked(c)→ writes{checked: [a, c], last_active: t2}(b is gone)User-visible effect: the file
bthat process A just gated appears un-checked to the next invocation, so Claude gets re-prompted with the Fact-Forcing gate for a file it already investigated. Not data loss in the strict sense, but a UX regression proportional to how often parallel tool calls land on the same session.The
READ_HEARTBEAT_MSthrottle landed in #1439 narrows the read-path window (~99% of invocations are now pure reads), but themarkChecked()path still exhibits the full race on every new gate, and the throttled heartbeat write inisChecked()still reproduces it when it does fire.Candidate fixes
Three options, roughly ordered by engineering cost / platform portability trade-off:
1.
flock-style advisory lock around read-modify-writeWrap the entire
loadState→ mutate →saveStatesequence in an exclusive lock onSTATE_FILE(or a sibling.lockfile). Portable options:proper-lockfile— cross-platform, already handles stale locks, small footprint.fs.openSync(path, 'wx')sentinel file with retry — zero dependency, but requires reinventing stale-lock handling.flockon POSIX +powershell Wait-Fileon Windows — brittle, not recommended.Pros: closes the race fully. Cons: new dependency or stale-lock logic; slight latency on every mutation.
2. Merge-style saves
Before
fs.renameSync, re-read the current on-disk state one last time and merge it with the in-memory state: union ofchecked,max(last_active). Roughly:Pros: zero new dependencies, no lock contention, matches the "additive-only state" semantics this code already has. Cons: still not strictly serializable (two merges can race), but entries are never lost — just occasionally duplicated before the next prune, which is fine.
3. In-memory coordination via short-lived lease
Maintain a per-session lease file (e.g.
state-<sid>.lease) that hook processes claim for, say, 100ms before mutating. Simpler than flock but less robust than merge-style saves on crash/ctrl-C.Pros: cheap. Cons: needs correct cleanup; behaves worst-case the same as option 2 under crashes.
Recommendation
Option 2 (merge-style saves) feels like the right default for ECC's profile: no new dependency, tolerant of crashes, preserves the "additive entries + bounded pruning" semantics already in place. If/when GateGuard needs stronger invariants (e.g. session-level transactions across multiple files), revisit with option 1.
Severity
P2 — UX regression only, no data loss. Easy to reproduce under parallel tool calls but rare in typical interactive sessions.
Links