Summary
Databend currently mixes several similar-but-different concepts in the codebase:
field vs column (especially when nested/compound types exist)
- stable identifier (
ColumnId) vs ordinal/index/position (usize)
- external/subsystem identifiers (Parquet/Iceberg metadata, Tantivy field id, index-file ordinals) vs Databend’s internal ids
This makes code harder to read and maintain, and it increases the risk of subtle bugs such as treating an ordinal as a stable id (e.g. idx as ColumnId), or passing usize through long call chains where the meaning changes over time.
This tracking issue documents the motivation, proposed terminology, and a staged roadmap to refactor field/column indexing toward stronger typing and explicit boundaries. The plan will be adjusted pragmatically based on real code constraints and review feedback.
Motivation / Why
- Prevent semantic bugs:
ColumnId is a stable identifier; many usize values are ordinals that are not stable across projection/reordering/schema evolution.
- Nested types amplify ambiguity: a logical field may be a compound type, while storage/index layers often operate on leaf/physical columns.
- Cross-boundary safety: Parquet/Iceberg/Tantivy/index-files have their own ids/ordinals. Assuming they are “the same as
ColumnId” without explicit mapping is fragile.
Proposed terminology
- Field: an entry in
TableSchema.fields (TableField). May represent a nested/compound logical column.
- Leaf column: a physical/leaf column used in storage formats and block representations (often derived from fields via expansion).
- Stable ID: an identifier that remains stable across schema evolution (e.g.
ColumnId(u32)).
- Ordinal / index / position: a
usize-like index into a specific vector/view (TableSchema.fields, DataSchema.fields, a bloom-index column list, etc.). Not stable.
Proposed strong types (examples)
The goal is to use newtypes to make intent explicit and prevent accidental mixing:
ColumnId(u32): stable id (keep existing semantics/wire format)
TableFieldIndex(usize): index into TableSchema.fields
DataFieldIndex(usize): index into DataSchema.fields (often equals DataBlock column position)
LeafColumnIndex(usize): index into a leaf-column view (explicit “leaf” meaning)
ParquetFieldId(u32): field id stored in parquet/arrow metadata ("PARQUET:field_id")
InvertedIndexFieldId(u32): Tantivy schema field id (subsystem id)
BloomIndexColumnOrdinal(usize): bloom index column ordinal (subsystem ordinal)
Key rule: external/subsystem ids must be mapped explicitly (no implicit equivalence).
Goals
- Make
ColumnId vs ordinals unambiguous at the type level.
- Replace “semantic
usize” plumbing with typed indices and typed schema helper APIs.
- Make Parquet/Iceberg/Tantivy/index-file boundaries explicit via conversion helpers.
- Keep changes incremental and reviewable (small PRs), avoiding a “big bang” refactor.
- Maintain compatibility where needed (e.g. for serialized plans), using
serde(rename = "...") when renaming fields for clarity.
Non-goals
- No global redefinition of all ids (e.g. no forced newtype for
ColumnId across the entire repo in one shot).
- No meta/proto/wire format changes.
- No intentional behavior changes unless an existing bug is discovered and covered by tests.
Refactoring principles
- Eliminate anti-patterns:
idx as ColumnId
- variables named
field_id that actually carry an ordinal
u32/usize casts that cross subsystem boundaries without explicit meaning
- Prefer typed APIs (incrementally): e.g.
field_at(TableFieldIndex), index_of_field(...) -> TableFieldIndex, project_typed(...).
- Put subsystem types near the subsystem; keep broadly reusable types in shared crates.
Roadmap (staged PR series)
Exact slicing may change based on dependencies and review feedback.
-
Foundation
- Introduce core newtypes + conversion helpers.
- Add typed schema helper APIs while keeping existing APIs.
-
Indexing subsystems
- Bloom index: replace “ordinal stored in
ColumnId” patterns with an explicit ordinal type.
- Inverted index: wrap Tantivy field id in a dedicated type and stop leaking raw
u32/usize.
-
Storage format boundaries
- Parquet/Iceberg: treat
"PARQUET:field_id" as ParquetFieldId, and require explicit mapping to/from ColumnId.
-
Planner/executor cleanup
- Rename misleading variables/maps (
field_id → table_field_index where appropriate).
- Adopt typed indices in hot paths (e.g. mutation pipeline) with compatibility preserved via serde rename if needed.
-
Follow-ups (optional)
- Doc: a short developer note describing “field vs leaf column” and “stable id vs ordinal”, plus common pitfalls.
- Additional sweeps across remaining modules.
Tracking checklist
Open questions
- Do we need a dedicated type to distinguish
DataSchema index vs DataBlock column position (when they can diverge)?
- For nested fields, should we introduce a typed
ColumnPath (instead of passing Vec<String> / name:subname strings around)?
Summary
Databend currently mixes several similar-but-different concepts in the codebase:
fieldvscolumn(especially when nested/compound types exist)ColumnId) vs ordinal/index/position (usize)This makes code harder to read and maintain, and it increases the risk of subtle bugs such as treating an ordinal as a stable id (e.g.
idx as ColumnId), or passingusizethrough long call chains where the meaning changes over time.This tracking issue documents the motivation, proposed terminology, and a staged roadmap to refactor field/column indexing toward stronger typing and explicit boundaries. The plan will be adjusted pragmatically based on real code constraints and review feedback.
Motivation / Why
ColumnIdis a stable identifier; manyusizevalues are ordinals that are not stable across projection/reordering/schema evolution.ColumnId” without explicit mapping is fragile.Proposed terminology
TableSchema.fields(TableField). May represent a nested/compound logical column.ColumnId(u32)).usize-like index into a specific vector/view (TableSchema.fields,DataSchema.fields, a bloom-index column list, etc.). Not stable.Proposed strong types (examples)
The goal is to use newtypes to make intent explicit and prevent accidental mixing:
ColumnId(u32): stable id (keep existing semantics/wire format)TableFieldIndex(usize): index intoTableSchema.fieldsDataFieldIndex(usize): index intoDataSchema.fields(often equals DataBlock column position)LeafColumnIndex(usize): index into a leaf-column view (explicit “leaf” meaning)ParquetFieldId(u32): field id stored in parquet/arrow metadata ("PARQUET:field_id")InvertedIndexFieldId(u32): Tantivy schema field id (subsystem id)BloomIndexColumnOrdinal(usize): bloom index column ordinal (subsystem ordinal)Key rule: external/subsystem ids must be mapped explicitly (no implicit equivalence).
Goals
ColumnIdvs ordinals unambiguous at the type level.usize” plumbing with typed indices and typed schema helper APIs.serde(rename = "...")when renaming fields for clarity.Non-goals
ColumnIdacross the entire repo in one shot).Refactoring principles
idx as ColumnIdfield_idthat actually carry an ordinalu32/usizecasts that cross subsystem boundaries without explicit meaningfield_at(TableFieldIndex),index_of_field(...) -> TableFieldIndex,project_typed(...).Roadmap (staged PR series)
Foundation
Indexing subsystems
ColumnId” patterns with an explicit ordinal type.u32/usize.Storage format boundaries
"PARQUET:field_id"asParquetFieldId, and require explicit mapping to/fromColumnId.Planner/executor cleanup
field_id→table_field_indexwhere appropriate).Follow-ups (optional)
Tracking checklist
ColumnIdParquetFieldIdmappingOpen questions
DataSchemaindex vsDataBlockcolumn position (when they can diverge)?ColumnPath(instead of passingVec<String>/name:subnamestrings around)?