Skip to content

Pool: support multiple DBs#2179

Open
kkuehlz wants to merge 1 commit intomainfrom
kkuehler/pool-freelist
Open

Pool: support multiple DBs#2179
kkuehlz wants to merge 1 commit intomainfrom
kkuehler/pool-freelist

Conversation

@kkuehlz
Copy link
Copy Markdown
Contributor

@kkuehlz kkuehlz commented Apr 7, 2026

On main, a single database takes ownership of all the CNV and SEQ chunks in a pool. The pool object on main is more of a shim than an allocator. This branch expands the responsibility of the pool by moving the freelist into the pool. This supports multiple DBs as tenants, each of which can request chunks and recycle them back into a shared pool.

Core Changes

  • Catalog: Each DB gets a db_id -> CNV layout entry in the pool footer. This allows for dynamic resolution of CNV chunks to support multiple DB instances.
  • The pool owns a freelist of SEQ chunks.
  • Per-DB SEQ limits: Each DB can optionally store a max_seq_chunks limit in its db_metadata.
  • monad-mpt: Support for multi-DB pools and per-DB chunk limits

Migration Support

The core changes are here to support migrations. Consider two DBs, DB-primary and DB-migration. DB-migration may be allocated a smaller amount of SEQ chunks (during the migration phase). Once complete, DB-primary could be deleted, returning all of its chunks to the freelist, and the DB-migration can have its usable chunks expanded to contain the whole device. In other words, the shared freelist allows for dynamic up- and downsizing of DB partitions.

  • compaction: When a DB's chunk limit is reduced during live execution, compaction will begin to target that new disk size. Note that this is not immediate and the new disk size target becomes the target in a P-FF equation that runs every block.

Compatibility

  • Pool footer format is now MND1; old MND0 pools are rejected under create_if_needed instead of being silently reformatted.
  • db_metadata magic is now MONAD008.
  • Runtime open is now catalog-backed; pools must be initialized with monad-mpt --create so DB slots exist.

CLI examples

# Create a pool with two DBs
monad-mpt --storage /dev/triedb --create \
  --num-dbs 2 \
  --root-offsets-chunk-count-db1 2 \
  --root-offsets-chunk-count-db2 2

# At creation, limit both DBs to 32GB
monad-mpt --storage /dev/triedb --create \
  --num-dbs 2 \
  --max-seq-chunks-db1 128 \
  --max-seq-chunks-db2 128

# Increase DB1 to 64GB
monad-mpt --storage /dev/triedb --set-max-seq-chunks 256

Copilot AI review requested due to automatic review settings April 7, 2026 20:35
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Pool — support multiple DBs

This is a significant architectural change that moves the freelist from per-DB metadata into the pool layer, enabling multiple DBs to share a pool's SEQ chunks. The approach is sound — the pool becomes the true allocator and DBs become tenants. Here are my findings:


Issues

1. free_seq_chunk_count() traverses freelist without lock (potential infinite loop)
storage_pool.cpp:1034-1049free_seq_chunk_count() walks the freelist without acquiring the mutex. The docstring says "Reads without lock — safe for diagnostics", but if a concurrent free_seq_chunk() modifies next[] mid-traversal, you could follow a stale pointer into a cycle and loop forever. Consider either:

  • Adding a maximum iteration bound (count < total_chunks), or
  • Acquiring the lock for the traversal (it's already used for diagnostics only)

2. freelist_next memory layout overlaps with chunk_bytes_used
storage_pool.hpp:191-211 — Both chunk_bytes_used() and freelist_next() compute their position relative to this (the metadata footer). chunk_bytes_used occupies [this - N*4, this) and freelist_next occupies [this - 2*N*4, this - N*4). The freelist_header_t (mutex + head) sits below that. However, PER_CHUNK_META_BYTES is 2 * sizeof(uint32_t) but total_size() also adds sizeof(freelist_header_t). The chunks() formula divides by chunk_capacity + PER_CHUNK_META_BYTES but does not account for the freelist_header_t overhead in the divisor. For large pools this is negligible (header is ~44 bytes vs multi-MB chunks), but for very small test pools the freelist header could theoretically encroach into the last chunk's data region. Worth a comment or a static assertion.

3. init_freelist uses device-local chunk IDs starting from cnv, but freelist stores device-local IDs
storage_pool.cpp:1073for (size_t i = cnv; i + 1 < total; ++i) chains device-local IDs [cnv, total) into the freelist. This is correct because seq chunks start after cnv chunks within a device. But the freelist_next array is indexed [0, total) — the first cnv entries are unused/wasted. Not a bug, but worth a comment explaining the indexing.

4. get_lower_bound_free_space is now O(N) in free chunks
trie.hpp:564-568get_lower_bound_free_space() now calls free_seq_chunk_count() which traverses the entire freelist. The old implementation was O(1) (reading a cached counter). If this is called frequently (e.g., every block during compaction decisions), this could be a performance regression. Consider caching the count or using an atomic counter.

5. Missing bounds check in free_seq_chunk
storage_pool.cpp:1019free_seq_chunk(uint32_t global_seq_id) accesses chunks_[seq][global_seq_id] without validating that global_seq_id < chunks_[seq].size(). A corrupted or double-freed chunk ID could cause an out-of-bounds access. Consider adding MONAD_ASSERT(global_seq_id < chunks_[seq].size()).

6. disk_usage() division by zero when max_seq_chunks == 0 and io->chunk_count() == 0
trie.hpp:517-523 and update_aux.cpp — If max_seq_chunks is 0 and io->chunk_count() somehow returns 0, this divides by zero. Unlikely in practice but the old code had the same risk; just noting it.

7. Comma operator trick in test fixture initializer lists
test_fixtures_base.hpp:673-680 and node_writer_test.cpp:183-189 — Using the comma operator to call pool.register_db_slot(...) inside the member initializer is clever but fragile. If someone reorders members or the initializer, the register_db_slot call could execute at the wrong time. A factory function or an explicit init step would be clearer, though I understand this is test code.


Observations (non-blocking)

  • Format version bump (MND0→MND1, MONAD007→MONAD008): Migration path is clear — old pools are rejected with an actionable error message rather than silently corrupted. Good.

  • pthread_mutex in mmap'd region: The robust mutex with EOWNERDEAD handling is the right choice for a process-shared mutex in persistent storage. The comment about bounded leak (one chunk per crash) is a reasonable tradeoff.

  • chunk_list::free returns 0 from num_chunks: This is a reasonable deprecation path — chunk_list::free no longer has meaning at the DB level since the freelist moved to the pool. However, callers that still reference chunk_list::free will silently get 0, which could mask bugs. Consider removing the free enum value entirely or making it a compile error, rather than returning 0.

  • Test coverage: The new pool_catalog_test.cpp is thorough — covers registration, reopening, validation (out-of-range, duplicate, overlap), format rejection, and the migration simulation. Well done.

  • CLI --num-dbs validation: Uses std::stoul which accepts negative numbers (they wrap around). Consider using a stricter parser or checking v >= 1 after parsing. Currently the v < 1 check catches this since the wrap-around produces a very large number, but it's fragile.

This comment was marked as outdated.

Each seq chunk has a 2-byte state in pool metadata: {db_id, status}
where status is FREE/RESERVED/OWNED. Allocation and free follow a
two-phase protocol through RESERVED, enabling deterministic crash
recovery:

  allocate: FREE → RESERVED → (append to DB list) → OWNED
  free:     OWNED → RESERVED → (remove from DB list, destroy) → FREE

On startup, set_io reconciles any RESERVED chunk by checking whether
it appears in the DB's fast/slow list (promote to OWNED if found,
reclaim to FREE if not). Fresh DB init calls reclaim_all_chunks_for_db
to handle orphans from interrupted initialization.

Lock-free allocation via linear scan over chunk_states (~14KB for 7000
chunks). No mutex, no linked list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kkuehlz kkuehlz force-pushed the kkuehler/pool-freelist branch from f8a3cdb to bb72159 Compare April 8, 2026 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants