Skip to content

Graph size multi shard table identifier#3697

Merged
VardhanThigle merged 2 commits intoGoogleCloudPlatform:mainfrom
VardhanThigle:graph-size-multi-shard-table-identifier
Apr 17, 2026
Merged

Graph size multi shard table identifier#3697
VardhanThigle merged 2 commits intoGoogleCloudPlatform:mainfrom
VardhanThigle:graph-size-multi-shard-table-identifier

Conversation

@VardhanThigle
Copy link
Copy Markdown
Contributor

@VardhanThigle VardhanThigle commented Apr 15, 2026

Propogating DataSourceID as a part of Table Identifier for Multi-Shard Reading

This is the third child of #3684 .

Design Decision

This PR adds dataSourceId to the tableIdentifer.
As a part of Milestone-1, a Range now has the TableIdentifier which helps move the table level details from Ptransform construction to the data layer. As was discussed in the same Milestone, the tableIdentifier was design to be extensible by allowing the DataSourceID to be contained in it. This helps in allowing the reader to correctly select the datasource that the given Range is associated and also eliminates the need to actually embed (and shuffle) the entire dataSource as a part of the range. The DataSourceProvider checked in as a part of #3685 will be provided to the Reader at construction time and that map coupled with the ID provided by the range as a part of this PR will help the Reader connect to the right source.

Key Changes:

  • TableIdentifier Enhancement: Added a dataSourceId() field to correctly distinguish tables with the same name that exist on different physical shards.
  • JdbcSchemaReference Integration: Propagated the dataSourceId into the schema reference layer.
  • Unique ID Generation: Implemented a robust ID generation mechanism in JdbcIOWrapperConfig to ensure every shard configuration has a unique identifier throughout the pipeline lifecycle.

Rationale:

In a multi-shard environment, a table name (e.g., "Users") is no longer unique. We must qualify every table with its source database instance to ensure that Dataflow's grouping and splitting logic routes data to the correct destination.


Why it's Safe (Identity Collision Prevention)

  • Global Uniqueness: By combining shardID and tableName into a single identifier, we prevent collisions that would previously have caused Dataflow to aggregate data from different physical sources into a single logical partition.
  • Deterministic Hashing: Updated equals() and hashCode() in TableIdentifier to include the dataSourceId. This is critical for Beam's GroupByKey operations, ensuring that ranges from different shards are never incorrectly merged.
  • Nullable Robustness: Implemented strict null-checks and added @Nullable annotations where appropriate to ensure the system handles legacy configurations that may not have an explicit dataSourceId.

Tests

The added tests verify that:

  • Tables with the same name but different shard IDs are treated as distinct entities.
  • Hash collisions are avoided.
  • Propagation through the configuration layer is consistent.

--

A quick note on PR size

This PR is divided into 2 commits - the first one just adds the identifier.
The second one ensures that the builder is correctly initialized in existing UTs - given the spotless requirement, this appears to have gotten into many lines (though in essentiality it's not)

@VardhanThigle VardhanThigle changed the title Graph size multi shard table identifier [Draft] Graph size multi shard table identifier Apr 15, 2026
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a dataSourceId to the TableIdentifier to support multi-shard reading in the SourceDB to Spanner pipeline. By qualifying table identifiers with a data source ID, the system can now correctly distinguish between tables with the same name residing on different physical shards, preventing data aggregation errors and ensuring correct routing in Dataflow.

Highlights

  • TableIdentifier Enhancement: Added a dataSourceId field to TableIdentifier to uniquely identify tables across different physical shards.
  • Configuration Propagation: Updated JdbcIoWrapper and JdbcSchemaReference to propagate the dataSourceId, ensuring consistent identification throughout the pipeline.
  • Collision Prevention: Updated equals and hashCode methods in TableIdentifier to include dataSourceId, preventing identity collisions in multi-shard environments.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.67%. Comparing base (4999e32) to head (ae8cef8).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3697      +/-   ##
============================================
+ Coverage     57.66%   58.67%   +1.00%     
+ Complexity     2570     2124     -446     
============================================
  Files           533      505      -28     
  Lines         30172    29134    -1038     
  Branches       3305     3192     -113     
============================================
- Hits          17400    17095     -305     
+ Misses        11774    11061     -713     
+ Partials        998      978      -20     
Components Coverage Δ
spanner-templates 73.77% <100.00%> (+<0.01%) ⬆️
spanner-import-export ∅ <ø> (∅)
spanner-live-forward-migration 80.77% <ø> (ø)
spanner-live-reverse-replication 77.46% <ø> (ø)
spanner-bulk-migration 89.33% <100.00%> (+<0.01%) ⬆️
gcs-spanner-dv 86.67% <ø> (ø)
Files with missing lines Coverage Δ
...source/reader/io/jdbc/iowrapper/JdbcIoWrapper.java 97.89% <100.00%> (+0.03%) ⬆️
...e/reader/io/jdbc/iowrapper/config/TableConfig.java 100.00% <ø> (ø)
...io/jdbc/uniformsplitter/range/TableIdentifier.java 100.00% <100.00%> (ø)

... and 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@VardhanThigle VardhanThigle marked this pull request as ready for review April 15, 2026 13:15
@VardhanThigle VardhanThigle requested a review from a team as a code owner April 15, 2026 13:15
@VardhanThigle VardhanThigle changed the title [Draft] Graph size multi shard table identifier Graph size multi shard table identifier Apr 15, 2026
Copy link
Copy Markdown
Member

@aasthabharill aasthabharill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@VardhanThigle VardhanThigle merged commit f3c3064 into GoogleCloudPlatform:main Apr 17, 2026
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants