Conversation
There was a problem hiding this comment.
PR Review: Pool — support multiple DBs
This is a significant architectural change that moves the freelist from per-DB metadata into the pool layer, enabling multiple DBs to share a pool's SEQ chunks. The approach is sound — the pool becomes the true allocator and DBs become tenants. Here are my findings:
Issues
1. free_seq_chunk_count() traverses freelist without lock (potential infinite loop)
storage_pool.cpp:1034-1049 — free_seq_chunk_count() walks the freelist without acquiring the mutex. The docstring says "Reads without lock — safe for diagnostics", but if a concurrent free_seq_chunk() modifies next[] mid-traversal, you could follow a stale pointer into a cycle and loop forever. Consider either:
- Adding a maximum iteration bound (
count < total_chunks), or - Acquiring the lock for the traversal (it's already used for diagnostics only)
2. freelist_next memory layout overlaps with chunk_bytes_used
storage_pool.hpp:191-211 — Both chunk_bytes_used() and freelist_next() compute their position relative to this (the metadata footer). chunk_bytes_used occupies [this - N*4, this) and freelist_next occupies [this - 2*N*4, this - N*4). The freelist_header_t (mutex + head) sits below that. However, PER_CHUNK_META_BYTES is 2 * sizeof(uint32_t) but total_size() also adds sizeof(freelist_header_t). The chunks() formula divides by chunk_capacity + PER_CHUNK_META_BYTES but does not account for the freelist_header_t overhead in the divisor. For large pools this is negligible (header is ~44 bytes vs multi-MB chunks), but for very small test pools the freelist header could theoretically encroach into the last chunk's data region. Worth a comment or a static assertion.
3. init_freelist uses device-local chunk IDs starting from cnv, but freelist stores device-local IDs
storage_pool.cpp:1073 — for (size_t i = cnv; i + 1 < total; ++i) chains device-local IDs [cnv, total) into the freelist. This is correct because seq chunks start after cnv chunks within a device. But the freelist_next array is indexed [0, total) — the first cnv entries are unused/wasted. Not a bug, but worth a comment explaining the indexing.
4. get_lower_bound_free_space is now O(N) in free chunks
trie.hpp:564-568 — get_lower_bound_free_space() now calls free_seq_chunk_count() which traverses the entire freelist. The old implementation was O(1) (reading a cached counter). If this is called frequently (e.g., every block during compaction decisions), this could be a performance regression. Consider caching the count or using an atomic counter.
5. Missing bounds check in free_seq_chunk
storage_pool.cpp:1019 — free_seq_chunk(uint32_t global_seq_id) accesses chunks_[seq][global_seq_id] without validating that global_seq_id < chunks_[seq].size(). A corrupted or double-freed chunk ID could cause an out-of-bounds access. Consider adding MONAD_ASSERT(global_seq_id < chunks_[seq].size()).
6. disk_usage() division by zero when max_seq_chunks == 0 and io->chunk_count() == 0
trie.hpp:517-523 and update_aux.cpp — If max_seq_chunks is 0 and io->chunk_count() somehow returns 0, this divides by zero. Unlikely in practice but the old code had the same risk; just noting it.
7. Comma operator trick in test fixture initializer lists
test_fixtures_base.hpp:673-680 and node_writer_test.cpp:183-189 — Using the comma operator to call pool.register_db_slot(...) inside the member initializer is clever but fragile. If someone reorders members or the initializer, the register_db_slot call could execute at the wrong time. A factory function or an explicit init step would be clearer, though I understand this is test code.
Observations (non-blocking)
-
Format version bump (MND0→MND1, MONAD007→MONAD008): Migration path is clear — old pools are rejected with an actionable error message rather than silently corrupted. Good.
-
pthread_mutexin mmap'd region: The robust mutex withEOWNERDEADhandling is the right choice for a process-shared mutex in persistent storage. The comment about bounded leak (one chunk per crash) is a reasonable tradeoff. -
chunk_list::freereturns 0 fromnum_chunks: This is a reasonable deprecation path —chunk_list::freeno longer has meaning at the DB level since the freelist moved to the pool. However, callers that still referencechunk_list::freewill silently get 0, which could mask bugs. Consider removing thefreeenum value entirely or making it a compile error, rather than returning 0. -
Test coverage: The new
pool_catalog_test.cppis thorough — covers registration, reopening, validation (out-of-range, duplicate, overlap), format rejection, and the migration simulation. Well done. -
CLI
--num-dbsvalidation: Usesstd::stoulwhich accepts negative numbers (they wrap around). Consider using a stricter parser or checkingv >= 1after parsing. Currently thev < 1check catches this since the wrap-around produces a very large number, but it's fragile.
Each seq chunk has a 2-byte state in pool metadata: {db_id, status}
where status is FREE/RESERVED/OWNED. Allocation and free follow a
two-phase protocol through RESERVED, enabling deterministic crash
recovery:
allocate: FREE → RESERVED → (append to DB list) → OWNED
free: OWNED → RESERVED → (remove from DB list, destroy) → FREE
On startup, set_io reconciles any RESERVED chunk by checking whether
it appears in the DB's fast/slow list (promote to OWNED if found,
reclaim to FREE if not). Fresh DB init calls reclaim_all_chunks_for_db
to handle orphans from interrupted initialization.
Lock-free allocation via linear scan over chunk_states (~14KB for 7000
chunks). No mutex, no linked list.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f8a3cdb to
bb72159
Compare
On main, a single database takes ownership of all the CNV and SEQ chunks in a pool. The pool object on
mainis more of a shim than an allocator. This branch expands the responsibility of the pool by moving the freelist into the pool. This supports multiple DBs as tenants, each of which can request chunks and recycle them back into a shared pool.Core Changes
db_id -> CNV layoutentry in the pool footer. This allows for dynamic resolution of CNV chunks to support multiple DB instances.max_seq_chunkslimit in itsdb_metadata.monad-mpt: Support for multi-DB pools and per-DB chunk limitsMigration Support
The core changes are here to support migrations. Consider two DBs,
DB-primaryandDB-migration.DB-migrationmay be allocated a smaller amount of SEQ chunks (during the migration phase). Once complete,DB-primarycould be deleted, returning all of its chunks to the freelist, and theDB-migrationcan have its usable chunks expanded to contain the whole device. In other words, the shared freelist allows for dynamic up- and downsizing of DB partitions.Compatibility
CLI examples