Skip to content

HistoryPruner: self-bootstrap in TryPruneHistory instead of relying on SyncServer#11254

Open
AnkushinDaniil wants to merge 3 commits intomasterfrom
fix/history-pruner-bootstrap-without-sync-server
Open

HistoryPruner: self-bootstrap in TryPruneHistory instead of relying on SyncServer#11254
AnkushinDaniil wants to merge 3 commits intomasterfrom
fix/history-pruner-bootstrap-without-sync-server

Conversation

@AnkushinDaniil
Copy link
Copy Markdown
Contributor

@AnkushinDaniil AnkushinDaniil commented Apr 20, 2026

Changes

  • HistoryPruner.TryPruneHistory: remove the redundant pre-lock !_hasLoadedDeletePointer gate so the pruner self-bootstraps on first entry instead of depending on SyncServer.OnNewRange to call OldestBlockHeader at a %32 block boundary. The inner path at HistoryPruner.cs:309 already calls TryLoadDeletePointer() and handles failure — the outer check only served to suppress bootstrap until an external caller primed it.

Types of changes

What types of changes does your code introduce?

  • Bugfix (a non-breaking change that fixes an issue)

Testing

Requires testing

  • Yes

If yes, did you write tests?

  • Yes

Documentation

Requires documentation update

  • No

Requires explanation in Release Notes

  • No

Remarks

Root cause

On a node whose head never crosses a %32 block boundary after startup (frozen snapshot, peerless, no CL attached), history pruning never runs even when --History.Pruning=Rolling is configured. _hasLoadedDeletePointer only flips when TryLoadDeletePointer() runs, and in production that only happens via the OldestBlockHeader getter, which is only called from SyncServer.OnNewRange (SyncServer.cs:472) gated on block.Number % 32 == 0 (SyncServer.cs:469). Nothing inside HistoryPruner observes or controls that condition, so the pruner waits on an external event that may never fire.

Why safe

  • Concurrency unchanged: _currentlyPruning atomic still prevents concurrent pruning tasks; _pruneLock still serializes actual work inside TryPruneHistory.
  • Cost: one extra Monitor.TryEnter/Monitor.Exit per trigger during the brief startup window before bootstrap — one-time, not steady state. After first load, TryLoadDeletePointer() early-returns at line 531 with no DB read.
  • Pre-lock ShouldPruneHistory call remains valid before bootstrap: reads _deletePointer (default 0) and _lastPrunedTimestamp (default null); both branches return sensibly and let the flow fall into the lock path where bootstrap + prune happen atomically.

Relationship to #9677

The pre-lock gate was introduced in #9677 as a fast-path optimization. This PR removes only that one line and preserves every other improvement from #9677 (atomic _currentlyPruning, exclusive scheduling, TryScheduleTask rename)

@claude
Copy link
Copy Markdown

claude bot commented Apr 20, 2026

Claude finished @AnkushinDaniil's task in 2m 15s —— View job


Code Review

  • Gather context and read relevant files
  • Load applicable coding/robustness rules
  • Analyze the one-line fix in HistoryPruner.cs
  • Review the regression test in HistoryPrunerTests.cs
  • Check concurrency correctness
  • Post findings

Summary

This is a clean, minimal, well-reasoned fix. The root cause diagnosis is accurate and the solution correctly addresses it. No critical, high, or medium issues found.

No findings that block merge.


Fix analysis (HistoryPruner.cs:295)

The removed line:

!_hasLoadedDeletePointer ||

created a chicken-and-egg deadlock: _hasLoadedDeletePointer was only set inside TryLoadDeletePointer(), which was only reachable via:

  1. OldestBlockHeader getter → gated on SyncServer.OnNewRange at block.Number % 32 == 0, or
  2. The inner path in TryPruneHistory itself (line 308) — but gated by the outer !_hasLoadedDeletePointer check that was just removed.

After the fix, TryPruneHistory reaches the lock, calls TryLoadDeletePointer() (line 308), and self-bootstraps. The inner TryLoadDeletePointer() already handles the failure path gracefully and short-circuits on repeated calls (if (_hasLoadedDeletePointer) return true at line 530), so steady-state cost is negligible.

Concurrency correctness: All invariants preserved:

  • _pruneLock still serialises mutation of _deletePointer, _deletePointerHeader, _hasLoadedDeletePointer, and _lastPrunedTimestamp.
  • _currentlyPruning atomic still prevents concurrent task scheduling.
  • Monitor.TryEnter with 100ms timeout is still used — no deadlock risk introduced.

Pre-lock ShouldPruneHistory with unbootstrapped _deletePointer = 1: For Rolling mode this uses the head timestamp (not _deletePointer), so no false negative. For UseAncientBarriers mode, 1 < ancientBarrier is true on first entry, admitting the lock path where the inner ShouldPruneHistory re-validates after real bootstrap. This is a transient one-time permissive pass, not a correctness issue.


Test analysis (HistoryPrunerTests.cs:97–150)

The test correctly and precisely reproduces the failure: it omits the CheckOldestAndCutoff / OldestBlockHeader call that previously masked the bug by bootstrapping _hasLoadedDeletePointer as a side effect. The comment block at lines 99–103 clearly explains the regression scenario. Assertions cover pruned blocks, preserved blocks, genesis, and head — matching the existing test patterns exactly.


Low (non-blocking)

_hasLoadedDeletePointer is not volatile — the fast-path read in OldestBlockHeader:215 is outside any lock. A stale false read causes one extra Monitor.TryEnter/Monitor.Exit overhead per call, not incorrect behaviour (correctness is guaranteed by the re-check inside the lock). This is a pre-existing condition unchanged by this PR.


Overall: The fix is correct, safe, and appropriately tested. Recommend merge after the usual review from @LukaszRozmej given context ownership.

On a node whose head never crosses a %32 block boundary after startup
(e.g. frozen snapshot, peerless, no CL), history pruning never runs.

Root cause: the pre-lock gate in TryPruneHistory short-circuits on
!_hasLoadedDeletePointer, and the only production caller that flips
that flag is SyncServer.OnNewRange -> HistoryPruner.OldestBlockHeader,
which is gated on block.Number % 32 == 0. Nothing inside HistoryPruner
observes or controls that condition, so the pruner relies on an
external trigger that may never fire.

The inner lock path already calls TryLoadDeletePointer() and exits
cleanly on failure, so the outer gate is redundant. Removing it lets
TryPruneHistory self-bootstrap on the first successful entry (triggered
by the one-shot FireProcessingQueueEmpty at BlockchainProcessor startup).

The redundant gate was introduced in #9677 as a fast-path optimization;
this removes only that one line and preserves every other improvement
from that PR (atomic _currentlyPruning scheduling, exclusive pruning,
etc.). Added regression test that reproduces the frozen-head condition.
@AnkushinDaniil AnkushinDaniil force-pushed the fix/history-pruner-bootstrap-without-sync-server branch from 4e338d6 to d585c50 Compare April 20, 2026 13:43
Copy link
Copy Markdown
Member

@LukaszRozmej LukaszRozmej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude can HistoryPrunerTests be deduplicated with TestCase/TestCaseSource?

@claude
Copy link
Copy Markdown

claude bot commented Apr 20, 2026

Claude finished @LukaszRozmej's task in 4m 44s —— View job


Deduplicating HistoryPrunerTests with TestCase/TestCaseSource

  • Read the test file to identify duplication patterns
  • Load applicable coding/robustness rules
  • Implement TestCase/TestCaseSource refactor
  • Push changes

Yes, done — committed in 40f7d00. -194 / +57 lines across the test file.

What changed

Tests 1–4 → one [TestCaseSource] method (Prunes_history)

The four pruning tests (Can_prune_blocks_older_than_specified_epochs, Can_prune_without_prior_oldest_block_read, Can_prune_to_ancient_barriers, Prunes_up_to_sync_pivot) shared identical structure. They collapse into PruningCases() + Prunes_history(...) parameterized on:

Parameter Purpose
historyConfig pruning mode + retention config
syncPivot where SyncPivot is set
primeWithOldestRead whether to call CheckOldestAndCutoff before pruning (the false case is the regression for this PR)
expectedPruneBelow threshold for pruned vs preserved block assertions
finalCutoff expected CutoffBlockNumber in the final state check

SetName(...) on each TestCaseData preserves the original test names for discovery and filtering.

Tests 7–8 → one [TestCase] method (Validates_config)

Can_accept_valid_config used Substitute.For<ISpecProvider>() (which returns MinHistoryRetentionEpochs = 0 by default) while Can_reject_invalid_config used a real TestSpecProvider. Both collapse with explicit minHistoryRetentionEpochs parameters; TestSpecProvider is used in both cases with the same effect.

Kept separateCan_find_oldest_block (tests binary search via SetDeletePointerToOldestBlock, no per-block assertions) and Does_not_prune_when_disabled (only 10 blocks, no pruning, no SyncPivot set).
· Branch

github-actions bot and others added 2 commits April 20, 2026 15:46
Co-authored-by: Lukasz Rozmej <LukaszRozmej@users.noreply.github.com>
HistoryConfig.RetentionEpochs is uint; test parameter was int, causing CS0266 build failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants