Feat #70: Add transform_columns for column adjustments during diff#71
Feat #70: Add transform_columns for column adjustments during diff#71sarath-mec wants to merge 13 commits intoerezsh:masterfrom
transform_columns for column adjustments during diff#71Conversation
transform_columns for on-the-fly value adjustments during difftransform_columns for column adjustments during diff
|
Btw you can run the tests locally. It should be pretty easy to set up with docker. (it only tests the basic dbs, but that's usually enough) |
|
Sure working on same |
|
Please ignore the commit. I am testing joindiffer locally |
|
Okay, ignoring |
|
I don't undestand, why is |
@erezsh This is because is_distinct_from is not available in NormalizeAsString. AttributeError: 'NormalizeAsString' object has no attribute 'is_distinct_from'
https://github.com/erezsh/reladiff/actions/runs/14164994160/job/39677702946 Note: If you think is_distinct_from can be added to sqeleteon library directly we could do that. We can work on a PR like that as well |
|
@erezsh Can this be merged. Do you have any thing else in mind? |
|
Something in the implementation doesn't seem right. I'll try to find time soon to look deeper into it. |
…oth Hash Diff and JoinDiffer In HashDiff, added logic to honor transform_rules for Key Columns as well (Tested Well in my Hashdiff usecase) JoinDiff already considers transform_rules for key columns
…iff into param-transform_columns
|
@erezsh accidentally reponed the PR, I had made some more changes There are options to optimize and may be confusing but Hash Diff is tested well
I have validated with the following case, where and the query was considering both cases |
`attr in ("database", "key_columns", "key_types", "relevant_columns", "_schema", "transform_columns")`
Motivation:
Comparing data between different database systems (like Oracle and PostgreSQL) or even within the same system often requires minor, database-specific transformations to make column values truly comparable. Examples include adjusting timezones, applying string functions, or rounding numeric values.
Previously, achieving this required creating temporary views or pre-transforming data, adding complexity and impossible in restricted environments
Solution:
This PR introduces a new
transform_columnsparameter to theTableSegmentclass. This parameter accepts a dictionary where:TableSegment.This allows users to specify minor adjustments directly within the
reladiffconfiguration, eliminating the need for external setup.Examples of
transform_columnsUsage:Here's how you might define
transform_columnsfor different scenarios:When creating
TableSegmentobjects:Implementation Details:
TableSegment:transform_columns: Dict[str, str]attribute to store the transformation rules provided by the user.HashDiffer:_relevant_columns_repr) and final value fetching (get_values).transform_columns, the provided SQL string is embedded usingsqeleton.queries.Code()instead of the original column reference (this[col]).NormalizeAsStringis applied to the result (either the original column or the transformedCode) to ensure consistent string representation for hashing and comparison.JoinDiffer:_create_outer_joinmethod.OUTER JOINquery, for each compared column, it now checks if a transformation exists intransform_columns.sqeleton.queries.Code(transformation_string)is used; otherwise, the original column reference (a[c1]/b[c2]) is used.is_distinct_from) and the selected output columns (a_cols,b_cols) now operate on the result of this potentially transformed expression, wrapped inNormalizeAsStringusing the original column's schema type for correct formatting.EmptyTableSegment:EmptyTableSegment.__getattr__by adding "transform_columns" to the allowed attributes in the assert statement. This allows JoinDiffer to correctly access the (empty) transform_columns dictionary from the underlying TableSegment when dealing with an empty table, preventing the assertion failure.