fix: force partition fields nullable in partitionValues_parsed schema#2420
fix: force partition fields nullable in partitionValues_parsed schema#2420DrakeLin wants to merge 2 commits intodelta-io:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2420 +/- ##
==========================================
- Coverage 88.35% 88.35% -0.01%
==========================================
Files 171 171
Lines 57025 57024 -1
Branches 57025 57024 -1
==========================================
- Hits 50383 50382 -1
Misses 4714 4714
Partials 1928 1928 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
scottsand-db
left a comment
There was a problem hiding this comment.
LGTM! Minor comments PTAL
dengsh12
left a comment
There was a problem hiding this comment.
Looks good! Just several questions
| /// and field types are the actual partition column data types. All fields are forced nullable | ||
| /// because partition values are derived from map lookups (`MAP_TO_STRUCT` over the | ||
| /// string-valued `partitionValues` map), which can return null for any missing key. |
There was a problem hiding this comment.
Seems not accurate? IIRC map keys are not missing. All fields are forced nullable here because we use the schema to transform all rows in checkpoint, and the not-add rows will not have partition values
| field.physical_name(column_mapping_mode).to_owned(), | ||
| field.data_type().clone(), | ||
| field.is_nullable(), | ||
| true, // Always nullable: values come from map lookups |
There was a problem hiding this comment.
This seems not the ideal fix? E.g. if somehow data is corrupted, and there is an add action with a null partition value in a non-null partition field, we will not error. Feel like the idea fix is to extra partition_value_parsed for add action only, with the nullability retained from table schema? Perhaps a less concern -- I'm okay with this as a short term fix if the idea fix is not easy to implement
There was a problem hiding this comment.
Agreed the schema can't enforce this on its own since it's shared across all row types. Filed #2439 to force the entire checkpoint output schema nullable as a potential followup
There was a problem hiding this comment.
I'm not pretty sure if we want all fields in checkpoint schema to be nullable -- seems loosing validations? e.g. if we make add.path nullable, there may be an add action without path(but not in this PR's scope, we can discuss later under #2439)
For current fix, I'm still wondering if we can retain the nullability from the table schema, and only run map_to_struct for add actions? Seems that will also fix the problem and defending data corruption. If that's complex I'm good with current fix
0fe95dd to
11f7cb7
Compare
|
closing in favor of github issue as long-term fix. |
What changes are proposed in this pull request?
Forces all partition fields to be nullable in
build_partition_values_parsed_schema. The checkpoint parquet format is sparse: each row is one action type, and all other action types' columns are null. Non-add rows have no partition values, sopartitionValues_parsedfields must be nullable regardless of the table schema.Without this fix, tables with non-nullable partition columns and
writeStatsAsStruct=truefail checkpoint creation with Arrow(InvalidArgumentError("Found unmasked nulls for non-nullable StructArray field ..."))`.The longer-term fix is to force the entire checkpoint output schema nullable (#2439).
How was this change tested?
New round trip integration test with non-nullable partition column