refactor: extract stateless-db crate and split pipeline/validator modules by flyq · Pull Request #122 · megaeth-labs/stateless-validator

flyq · 2026-04-21T14:38:40Z

Summary

Extract a new stateless-db crate from stateless-common, moving redb table definitions, read/write helpers, serialization, and ContractCache into a dedicated persistence layer.
Split the monolithic crates/stateless-core/src/pipeline.rs into a module (config, traits, fetcher, divergence, advancer, worker, tests) for clearer responsibility boundaries.
Split bin/stateless-validator/src/main.rs into main.rs (entry), app.rs (CLI/startup wiring), workers.rs (pipeline + signals + reporter), and lib.rs; integration tests move to tests/integration.rs.
Refresh AGENTS.md to describe the new layout and file map.
Bump PipelineConfig::fetcher_max_in_flight default from workers to 2 * workers, since fetchers are I/O-bound and should exceed the CPU-bound worker count to keep workers fed (bounded by --data-max-concurrent-requests / --witness-max-concurrent-requests).
Auto-create the --data-dir on startup in both stateless-validator and debug-trace-server, so operators no longer need to pre-create the directory.
Log the validated block range (start, end, count) at validator shutdown by snapshotting the canonical tip before the pipeline runs.
Enable the ansi feature of tracing-subscriber for colored terminal output, and promote [Worker] Successfully validated block from debug! to info!.
Add the rex4Time hardfork activation to test_data/mainnet/genesis.json.
Drop unused dependencies (serde, serde_json, eyre, thiserror, tempfile) from several crate manifests.

Test plan

cargo fmt --all --check
cargo clippy --workspace --all-targets --all-features
cargo sort --check --workspace --grouped
cargo test --workspace (unit + integration, including the updated test_pipeline_config_default_uses_cpu_count)
Run stateless-validator against a fresh (non-existent) --data-dir and confirm it is auto-created
Confirm [Main] Validated blocks this session log appears at shutdown with the expected start/end/count
Verify debug-trace-server startup with a non-existent --data-dir in local-cache mode

claude · 2026-04-21T14:39:54Z

The PR currently has no labels. Based on the diff, the enhancement label appears to be missing.

While the PR is primarily a refactor (extracting stateless-db, splitting pipeline.rs and main.rs), it also introduces several user-facing improvements that go beyond pure code reorganization:

Auto-create --data-dir on startup (operators no longer need to pre-create the directory)
Bump fetcher_max_in_flight default from workers to 2 * workers
Log validated block range at shutdown
Enable ANSI colored terminal output and promote block validation log to info!
Add rex4Time hardfork activation to genesis config

There is no dedicated refactor label in this repository, so enhancement is the closest match for this PR's scope.

codecov · 2026-04-21T14:41:14Z

Codecov Report

❌ Patch coverage is 85.21898% with 81 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.6%. Comparing base (11a81b3) to head (5fd690a).

Files with missing lines	Patch %	Lines
crates/stateless-core/src/pipeline/fetcher.rs	86.2%	33 Missing ⚠️
crates/stateless-core/src/pipeline/mod.rs	65.5%	31 Missing ⚠️
crates/stateless-core/src/pipeline/advancer.rs	85.5%	11 Missing ⚠️
crates/stateless-core/src/pipeline/divergence.rs	76.9%	3 Missing ⚠️
crates/stateless-core/src/pipeline/worker.rs	94.1%	2 Missing ⚠️
crates/stateless-core/src/evm_database.rs	80.0%	1 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

claude · 2026-04-21T14:43:41Z

Review

Clean, well-scoped refactoring. The three structural changes (new stateless-db crate, pipeline module split, main.rs → app.rs/workers.rs) all improve separation of concerns, and the logic is identical to before. Tests are faithfully migrated and extended.

A few notes:

Draft / unchecked test plan: One item is unchecked (Verify debug-trace-server startup with a non-existent --data-dir in local-cache mode). Mark ready and tick it before merge.
stateless-test-utils missing from AGENTS.md workspace table — see inline.
StoreError design — see inline for a design note on the stateless-core/redb coupling. Not a blocker.
fetcher_max_in_flight: 2 * workers change is sound — I/O-bound fetchers benefit from higher concurrency than CPU-bound workers, and the per-endpoint concurrency caps guard against provider overload.

claude · 2026-04-22T07:29:03Z

Review

Solid refactor. The crate extraction, module splits, and opaque Backend box in StoreError (which avoids leaking redb into the trait layer) are all the right calls. Operator improvements (auto-create data-dir, session block log, bounded startup retry) are good.

One issue to verify before merging:

bincode was moved from [dev-dependencies] to [dependencies] in crates/stateless-core/Cargo.toml. If it is only used under #[cfg(test)] (executor tests, the new pipeline/tests.rs), this is a spurious production dependency. Please confirm bincode is needed in non-test code, or revert it to dev-deps.

Pre-existing unresolved thread (AGENTS.md): stateless-test-utils is still absent from the workspace table despite still being a workspace member — please add it back.

claude · 2026-04-22T08:19:50Z

Good PR overall. The extraction of stateless-db, the pipeline module split, and the validator binary refactoring are all well-structured. The new integration tests add meaningful coverage. The FetcherState invariant documentation is particularly clear.

One inline comment below on a logging regression introduced by the prefix removal.

claude · 2026-04-22T08:25:57Z

Clean refactor. The stateless-db extraction and pipeline module split are well-structured, and the new unit test coverage for the pipeline stages is solid.

Three issues to address, one nit:

divergence.rs:58 — .expect() panics inside a function that returns Result<_, DivergenceError>. An empty chain is theoretically impossible at that call site, but panicking in library code is still wrong. The LocalChainCorrupt variant already exists; use it. (See inline comment for a one-liner fix.)

workers.rs:87-93 — Trailing periods on both signal-handler log messages ("shutting down." -> "shutting down"). New code should follow the logging guidelines from the start.

tests.rs:345-347 — test_find_divergence_catastrophic matches on result.unwrap_err().to_string().contains(...) even though DivergenceError::CatastrophicReorg is a matchable typed variant. Use matches!.

See inline comments for the exact suggestions.

claude · 2026-04-22T08:26:46Z

Good refactor overall — the stateless-db extraction is clean, the pipeline module split reads well, and the stateless-validator binary split into app/workers/chain_sync improves navigability significantly.

Two issues to address before merging:

Duplicate log (unresolved thread) — info!("Starting validation reporter") fires twice on startup: once at spawn site (workers.rs:55) and once inside validation_reporter (workers.rs:137). Drop the outer one.
ContractCache has no tests — new write-through cache logic with a three-tier read path and a NoopContractStore stateless mode; see inline comment.

One nit: the two retry loops in init_validator_db are copy-pasted; extracting a helper would cut ~15 duplicate lines.

Test plan has one unchecked item (debug-trace-server startup with non-existent data-dir); the create_dir_all code is in place, but manual verification before merge is worth completing.

claude · 2026-04-22T09:14:23Z

Clean refactor. All prior review comments have been addressed. A few small notes below, but overall this improves the codebase significantly — extracting stateless-db, splitting the pipeline module, and the ContractCache three-tier design are all sound.

claude · 2026-04-22T12:30:25Z

Large, well-organized refactoring. The crate split, pipeline module decomposition, StoreError typed-error boundary, and unbounded-retry RPC model are all sound. The new integration tests and ContractCache unit tests are solid. A few things:

Nit: on_rpc_complete(method, true, ...) — the bool parameter in RpcMetrics::on_rpc_complete is now permanently dead. With unbounded retry, the only terminal state for a logical call is success, so success=false is never passed. The on_rpc_retry path handles every failure. The trait signature now has a misleading parameter; consider dropping the bool in a follow-up.

Nit: The new debug-trace-server test plan checkbox is unchecked. The --data-dir auto-create for debug-trace-server in local-cache mode is worth confirming before merge — it's a one-line fs::create_dir_all path but it's the only item marked as not verified.

See inline comment on integration.rs.

claude · 2026-04-22T13:58:10Z

Review

Strong refactoring. The structural decompositions — stateless-db crate, pipeline module split, validator binary split — all reduce coupling cleanly and the new cache tests cover the important invariants. A few specific observations:

Infallible RPC client — the shift from Result to T throughout RpcClient and the centralized round_robin_with_backoff is a clean design. The round-level backoff semantics (escalate debug → warn at round 3, jitter, unbounded retry) and primary-failover for witnesses are well-documented and correctly tested. The metric recording change (record_request(true, ...) unconditionally) is correct given the new invariant.

ContractCache / Arc — the three-tier lookup chain (memory → disk → RPC with hash verification) is a meaningful correctness improvement over the previous approach. Having get_codes fail the whole batch on a single hash mismatch is stricter and right.

DB error classification in get_block_data_by_hash — distinguishing StoreError::MissingData (silent fallthrough to RPC) from other errors (warn + fallthrough) is a correctness improvement over the old let Ok(data) = ... that silently swallowed real backend errors.

Unresolved: std::mem::forget(temp_dir) in tests/integration.rs — replied to the existing thread. This is the only item to fix before merging.

claude · 2026-04-22T14:13:23Z

Good refactor overall — the crate split is clean, the pipeline module decomposition is clear, and previous review feedback has been addressed.

Two issues remain:

1. Unresolved: std::mem::forget(temp_dir) in integration tests — replied to the existing thread with a concrete fix.

2. Block-range log wrong on fresh start — see inline comment on workers.rs. When the validator has never run before, initial_tip = None (canonical tip is not set until the first block is advanced), so the (Some(before), Some(after)) pattern never matches and the log always prints 'No blocks validated this session' regardless of how many blocks were processed.

claude · 2026-04-22T16:27:58Z

+        let (a, b) = (hits_a.load(Ordering::Relaxed), hits_b.load(Ordering::Relaxed));
+        assert!(a >= 5, "primary (index 0) must be tried every call: got {a}");
+        // Backup should only be hit as fallback.
+        assert!(b <= a, "backup must not outpace primary: a={a}, b={b}");


nit: The doc comment says "the secondary is untouched while primary is up", but the test never has A healthy. After the first RPC-level failure, A returns "v0:AAAA" which also fails at decode, so both A and B fail on every round and both get hit equally (a ≈ b). b <= a is only trivially satisfied.

The missing case is: when A succeeds (returns valid data), B should receive zero hits. A focused test would use a real witness payload (or a mock that returns success) for A and assert b == 0 across multiple calls.

The test does verify that A is always tried first, which is the critical scheduling property — but the "untouched secondary" invariant is not actually exercised.

claude · 2026-04-22T16:28:21Z

Large, well-structured refactor. The crate extraction, pipeline split, and retry unification are all clean. Two items to address before merge:

PR description inaccuracy: The description says "promote [Worker] Successfully validated block from debug! to info!" but the code change in chain_sync.rs only removes the [Worker] prefix — the level stays debug!. Per the logging guidelines (validated blocks are high-frequency, not infrequent lifecycle events), debug! is actually the right level. Update the PR description to remove the incorrect promotion claim so the squash commit message is accurate.

Resolved threads: The mem::forget(temp_dir) leak is fixed in the new setup_test_db (returns TempDir). The initial_tip concern from the previous round is not a real bug — write_reset_to_anchor writes the anchor into CANONICAL_CHAIN, so get_canonical_tip() is always Some when run_with_signals is called; and without --start-block, app.rs exits early if canonical chain is empty. Both threads resolved.

flyq added 2 commits April 21, 2026 22:15

refactor4

27ae7b5

opt

f7ed7db

flyq requested review from Troublor, abelmega and vincent-k2026 as code owners April 21, 2026 14:38

flyq marked this pull request as draft April 21, 2026 14:39

claude Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread AGENTS.md

claude Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread crates/stateless-core/src/db.rs Outdated

claude Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread crates/stateless-core/src/pipeline/advancer.rs Outdated

opt

964670f