Feat #70: Add `transform_columns` for column adjustments during diff by sarath-mec · Pull Request #71 · erezsh/reladiff

sarath-mec · 2025-03-31T05:51:38Z

Motivation:

Comparing data between different database systems (like Oracle and PostgreSQL) or even within the same system often requires minor, database-specific transformations to make column values truly comparable. Examples include adjusting timezones, applying string functions, or rounding numeric values.

Previously, achieving this required creating temporary views or pre-transforming data, adding complexity and impossible in restricted environments

Solution:

This PR introduces a new transform_columns parameter to the TableSegment class. This parameter accepts a dictionary where:

Keys are the original column names in the table segment.
Values are raw SQL string expressions representing the transformation to be applied to that column during the diff process. The SQL syntax must be valid for the database associated with the specific TableSegment.

This allows users to specify minor adjustments directly within the reladiff configuration, eliminating the need for external setup.

Note: As transformations are typically specific to either the source or target database, this parameter is not overridden directly in diff_tables.

Examples of `transform_columns` Usage:

Here's how you might define transform_columns for different scenarios:

# Example for an Oracle TableSegment
oracle_transforms = {
    "AMOUNT": "ROUND(AMOUNT, 2)",
    "LEGACY_ID": "TO_CHAR(LEGACY_ID)",
    "LOB_DATA": "LENGTH(LOB_DATA)",
    "DESCRIPTION": "SUBSTR(DESCRIPTION, 1, 10)",
    "EVENT_TIMESTAMP": "CAST(EVENT_TIMESTAMP AT TIME ZONE 'UTC' AS TIMESTAMP)",
    "USER_NAME": "TRIM(USER_NAME)",
    "ACTIVE_FLAG": "CASE WHEN ACTIVE_FLAG = 'Y' THEN 1 ELSE 0 END"
}

# Example for a PostgreSQL TableSegment
postgres_transforms = {
    "amount": "ROUND(amount, 2)",
    "legacy_id": "CAST(legacy_id AS TEXT)", # or "legacy_id::TEXT",
    "lob_data": "LENGTH(lob_data)",
    "description": "SUBSTRING(description FROM 1 FOR 10)",
    "event_timestamp": "event_timestamp AT TIME ZONE 'America/New_York' AT TIME ZONE 'UTC'",
    "user_name": "TRIM(user_name)",
    "active_flag": "CAST(active_flag AS INTEGER)"
}

When creating TableSegment objects:

src_segment = TableSegment(..., transform_columns=oracle_transforms)
tgt_segment = TableSegment(..., transform_columns=postgres_transforms)

Implementation Details:

TableSegment:
- Added the transform_columns: Dict[str, str] attribute to store the transformation rules provided by the user.
HashDiffer:
- Modified the generation of expressions used for checksum calculation (_relevant_columns_repr) and final value fetching (get_values).
- If a column name exists in transform_columns, the provided SQL string is embedded using sqeleton.queries.Code() instead of the original column reference (this[col]).
- NormalizeAsString is applied to the result (either the original column or the transformed Code) to ensure consistent string representation for hashing and comparison.
JoinDiffer:
- Modified the _create_outer_join method.
- Within the OUTER JOIN query, for each compared column, it now checks if a transformation exists in transform_columns.
- If a transformation exists, sqeleton.queries.Code(transformation_string) is used; otherwise, the original column reference (a[c1]/b[c2]) is used.
- The comparison logic (is_distinct_from) and the selected output columns (a_cols, b_cols) now operate on the result of this potentially transformed expression, wrapped in NormalizeAsString using the original column's schema type for correct formatting.
EmptyTableSegment:
- Fixed AssertionError: Modified EmptyTableSegment.__getattr__ by adding "transform_columns" to the allowed attributes in the assert statement. This allows JoinDiffer to correctly access the (empty) transform_columns dictionary from the underlying TableSegment when dealing with an empty table, preventing the assertion failure.

Unified Output: Both HashDiffer and JoinDiffer will now output the transformed values for differing rows, providing a consistent representation of the data as it was compared.

erezsh · 2025-03-31T07:03:15Z

Btw you can run the tests locally. It should be pretty easy to set up with docker. (it only tests the basic dbs, but that's usually enough)

sarath-mec · 2025-04-01T12:56:03Z

Sure working on same

sarath-mec · 2025-04-02T18:04:16Z

Please ignore the commit. I am testing joindiffer locally

erezsh · 2025-04-02T18:32:41Z

Okay, ignoring

sarath-mec · 2025-04-02T23:08:26Z

@erezsh The test cases had issues with JoinDiffer as I was only using HashDiffer.

The last two commits fixes the testing and I was able to test it locally

Please check the changes and let me know if anything else is needed

erezsh · 2025-04-03T06:18:36Z

I don't undestand, why is OverrideNormalizeAsString neeeded?

sarath-mec · 2025-04-04T03:30:23Z

I don't undestand, why is OverrideNormalizeAsString neeeded?

@erezsh This is because is_distinct_from is not available in NormalizeAsString.

AttributeError: 'NormalizeAsString' object has no attribute 'is_distinct_from'

Introduced OverrideNormalizeAsString: To compare values after normalization and potential transformation using .is_distinct_from(), this was added. This class inherits LazyOps from sqeleton, enabling comparison methods on the normalized string representation. The comparison logic and selected output columns (a_cols, b_cols) now use this class.

https://github.com/erezsh/reladiff/actions/runs/14164994160/job/39677702946

Note: If you think is_distinct_from can be added to sqeleteon library directly we could do that. We can work on a PR like that as well

sarath-mec · 2025-04-24T02:03:41Z

@erezsh Can this be merged. Do you have any thing else in mind?

erezsh · 2025-04-24T09:35:56Z

Something in the implementation doesn't seem right. I'll try to find time soon to look deeper into it.

…oth Hash Diff and JoinDiffer In HashDiff, added logic to honor transform_rules for Key Columns as well (Tested Well in my Hashdiff usecase) JoinDiff already considers transform_rules for key columns

…iff into param-transform_columns

sarath-mec · 2025-06-11T21:40:58Z

@erezsh accidentally reponed the PR, I had made some more changes

There are options to optimize and may be confusing but Hash Diff is tested well

Created a new TableSegment method _get_transform_columns to support both Hash Diff and JoinDiffer
In HashDiff, added logic to honor transform_rules for Key Columns as well (Tested Well in my Hashdiff usecase)
JoinDiff already considers transform_rules for key columns

I have validated with the following case, where and the query was considering both cases

src_transform_columns = {}
tgt_transform_columns = {
    "employee_nbr": "CASE WHEN employee_nbr = 62230 THEN -1 ELSE employee_nbr END",
    "email_address": "CASE WHEN email_address = 'Ken.Brown@gfs.com' THEN 'sarath@gmail.com' ELSE email_address END"
}
src_tbl_segment = TableSegment(
    database=engine,
    table_path=src_table_path,
    key_columns=tuple(["employee_nbr"]),
    extra_columns=tuple(extra_cols),
    transform_columns=src_transform_columns,
    case_sensitive=True
    ).with_schema(refine=False, allow_empty_table=True)

tgt_tbl_segment = TableSegment(
    database=engine,
    table_path=tgt_table_path,
    key_columns=tuple(["employee_nbr"]),
    extra_columns=tuple(extra_cols),
    transform_columns=tgt_transform_columns,
    case_sensitive=True
    ).with_schema(refine=False, allow_empty_table=True)

`attr in ("database", "key_columns", "key_types", "relevant_columns", "_schema", "transform_columns")`

sarath-mec added 4 commits March 30, 2025 14:50

Fix erezsh#70 Add transform_columns

3a348e9

Fix erezsh#70 Updated transform_columns documentation

8888012

Fix erezsh#70 Added transform_columns logic to Join Differ

44d22ec

Fix erezsh#70 Removed unnecessary function

2039968

sarath-mec mentioned this pull request Mar 31, 2025

Option to add a column transform_rules in restricted environments to TableSegment #70

Open

sarath-mec changed the title ~~Feat #70: Add transform_columns for on-the-fly value adjustments during diff~~ Feat #70: Add transform_columns for column adjustments during diff Mar 31, 2025

Overriding NormalizeAsString to have is_distinct_from

db0b10b

Adding transform_columns to EmptyEmptyTableSegment.__getattr__

67e1fd2

erezsh reviewed Apr 4, 2025

View reviewed changes

Comment thread reladiff/joindiff_tables.py

Optimizing Code by removing

28031be

Created a new TableSegment method _get_transform_columns to support b…

61040e6

…oth Hash Diff and JoinDiffer In HashDiff, added logic to honor transform_rules for Key Columns as well (Tested Well in my Hashdiff usecase) JoinDiff already considers transform_rules for key columns

sarath-mec marked this pull request as draft June 11, 2025 21:30

sarath-mec added 2 commits June 11, 2025 16:30

Merge branch 'master' into param-transform_columns

66c6284

Merge branch 'param-transform_columns' of github.com:sarath-mec/relad…

bddd204

…iff into param-transform_columns

sarath-mec added 3 commits June 11, 2025 17:13

Adding IsDistinctFrom

0e563ee

Adding transform_cols in EmptyTableSegment to avoid assert

3960970

`attr in ("database", "key_columns", "key_types", "relevant_columns", "_schema", "transform_columns")`

_get_column_transforms added to avoid attr error

37bb41a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat #70: Add `transform_columns` for column adjustments during diff#71

Feat #70: Add `transform_columns` for column adjustments during diff#71
sarath-mec wants to merge 13 commits intoerezsh:masterfrom
sarath-mec:param-transform_columns

sarath-mec commented Mar 31, 2025 •

edited

Loading

Uh oh!

erezsh commented Mar 31, 2025 •

edited

Loading

Uh oh!

sarath-mec commented Apr 1, 2025

Uh oh!

sarath-mec commented Apr 2, 2025

Uh oh!

erezsh commented Apr 2, 2025

Uh oh!

sarath-mec commented Apr 2, 2025 •

edited

Loading

Uh oh!

erezsh commented Apr 3, 2025

Uh oh!

sarath-mec commented Apr 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

sarath-mec commented Apr 24, 2025

Uh oh!

erezsh commented Apr 24, 2025

Uh oh!

sarath-mec commented Jun 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sarath-mec commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation:

Solution:

Examples of transform_columns Usage:

Implementation Details:

Uh oh!

erezsh commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarath-mec commented Apr 1, 2025

Uh oh!

sarath-mec commented Apr 2, 2025

Uh oh!

erezsh commented Apr 2, 2025

Uh oh!

sarath-mec commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erezsh commented Apr 3, 2025

Uh oh!

sarath-mec commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sarath-mec commented Apr 24, 2025

Uh oh!

erezsh commented Apr 24, 2025

Uh oh!

sarath-mec commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sarath-mec commented Mar 31, 2025 •

edited

Loading

Examples of `transform_columns` Usage:

erezsh commented Mar 31, 2025 •

edited

Loading

sarath-mec commented Apr 2, 2025 •

edited

Loading

sarath-mec commented Apr 4, 2025 •

edited

Loading

sarath-mec commented Jun 11, 2025 •

edited

Loading