Graph size multi shard table identifier by VardhanThigle · Pull Request #3697 · GoogleCloudPlatform/DataflowTemplates

VardhanThigle · 2026-04-15T12:22:57Z

Propogating DataSourceID as a part of Table Identifier for Multi-Shard Reading

This is the third child of #3684 .

Design Decision

This PR adds dataSourceId to the tableIdentifer.
As a part of Milestone-1, a Range now has the TableIdentifier which helps move the table level details from Ptransform construction to the data layer. As was discussed in the same Milestone, the tableIdentifier was design to be extensible by allowing the DataSourceID to be contained in it. This helps in allowing the reader to correctly select the datasource that the given Range is associated and also eliminates the need to actually embed (and shuffle) the entire dataSource as a part of the range. The DataSourceProvider checked in as a part of #3685 will be provided to the Reader at construction time and that map coupled with the ID provided by the range as a part of this PR will help the Reader connect to the right source.

Key Changes:

TableIdentifier Enhancement: Added a dataSourceId() field to correctly distinguish tables with the same name that exist on different physical shards.
JdbcSchemaReference Integration: Propagated the dataSourceId into the schema reference layer.
Unique ID Generation: Implemented a robust ID generation mechanism in JdbcIOWrapperConfig to ensure every shard configuration has a unique identifier throughout the pipeline lifecycle.

Rationale:

In a multi-shard environment, a table name (e.g., "Users") is no longer unique. We must qualify every table with its source database instance to ensure that Dataflow's grouping and splitting logic routes data to the correct destination.

Why it's Safe (Identity Collision Prevention)

Global Uniqueness: By combining shardID and tableName into a single identifier, we prevent collisions that would previously have caused Dataflow to aggregate data from different physical sources into a single logical partition.
Deterministic Hashing: Updated equals() and hashCode() in TableIdentifier to include the dataSourceId. This is critical for Beam's GroupByKey operations, ensuring that ranges from different shards are never incorrectly merged.
Nullable Robustness: Implemented strict null-checks and added @Nullable annotations where appropriate to ensure the system handles legacy configurations that may not have an explicit dataSourceId.

Tests

The added tests verify that:

Tables with the same name but different shard IDs are treated as distinct entities.
Hash collisions are avoided.
Propagation through the configuration layer is consistent.

--

A quick note on PR size

This PR is divided into 2 commits - the first one just adds the identifier.
The second one ensures that the builder is correctly initialized in existing UTs - given the spotless requirement, this appears to have gotten into many lines (though in essentiality it's not)

gemini-code-assist · 2026-04-15T12:23:15Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a dataSourceId to the TableIdentifier to support multi-shard reading in the SourceDB to Spanner pipeline. By qualifying table identifiers with a data source ID, the system can now correctly distinguish between tables with the same name residing on different physical shards, preventing data aggregation errors and ensuring correct routing in Dataflow.

Highlights

TableIdentifier Enhancement: Added a dataSourceId field to TableIdentifier to uniquely identify tables across different physical shards.
Configuration Propagation: Updated JdbcIoWrapper and JdbcSchemaReference to propagate the dataSourceId, ensuring consistent identification throughout the pipeline.
Collision Prevention: Updated equals and hashCode methods in TableIdentifier to include dataSourceId, preventing identity collisions in multi-shard environments.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

codecov · 2026-04-15T12:35:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.67%. Comparing base (4999e32) to head (ae8cef8).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3697      +/-   ##
============================================
+ Coverage     57.66%   58.67%   +1.00%     
+ Complexity     2570     2124     -446     
============================================
  Files           533      505      -28     
  Lines         30172    29134    -1038     
  Branches       3305     3192     -113     
============================================
- Hits          17400    17095     -305     
+ Misses        11774    11061     -713     
+ Partials        998      978      -20

Components	Coverage Δ
spanner-templates	`73.77% <100.00%> (+<0.01%)`	⬆️
spanner-import-export	`∅ <ø> (∅)`
spanner-live-forward-migration	`80.77% <ø> (ø)`
spanner-live-reverse-replication	`77.46% <ø> (ø)`
spanner-bulk-migration	`89.33% <100.00%> (+<0.01%)`	⬆️
gcs-spanner-dv	`86.67% <ø> (ø)`

Files with missing lines	Coverage Δ
...source/reader/io/jdbc/iowrapper/JdbcIoWrapper.java	`97.89% <100.00%> (+0.03%)`	⬆️
...e/reader/io/jdbc/iowrapper/config/TableConfig.java	`100.00% <ø> (ø)`
...io/jdbc/uniformsplitter/range/TableIdentifier.java	`100.00% <100.00%> (ø)`

... and 28 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

aasthabharill

LGTM

VardhanThigle added 2 commits April 15, 2026 12:11

Generating unique id and propogating to talbeIdentifier

dc585ed

Inititalizing tableIdentifier in existing UTs

ae8cef8

VardhanThigle requested review from aasthabharill, bharadwaj-aditya, rohitwali and sm745052 April 15, 2026 12:22

pull-request-size bot added the size/XXL label Apr 15, 2026

VardhanThigle changed the title ~~Graph size multi shard table identifier~~ [Draft] Graph size multi shard table identifier Apr 15, 2026

VardhanThigle added the ignore-for-release label Apr 15, 2026

VardhanThigle marked this pull request as ready for review April 15, 2026 13:15

VardhanThigle requested a review from a team as a code owner April 15, 2026 13:15

VardhanThigle changed the title ~~[Draft] Graph size multi shard table identifier~~ Graph size multi shard table identifier Apr 15, 2026

VardhanThigle mentioned this pull request Apr 16, 2026

[Not-for-review] Graph Size Optimization Milestone-2 multi shard read. #3684

Draft

VardhanThigle requested a review from shreyakhajanchi April 16, 2026 10:40

aasthabharill approved these changes Apr 16, 2026

View reviewed changes

VardhanThigle merged commit f3c3064 into GoogleCloudPlatform:main Apr 17, 2026
17 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph size multi shard table identifier#3697

Graph size multi shard table identifier#3697
VardhanThigle merged 2 commits intoGoogleCloudPlatform:mainfrom
VardhanThigle:graph-size-multi-shard-table-identifier

VardhanThigle commented Apr 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Apr 15, 2026

Uh oh!

codecov bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

aasthabharill left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

VardhanThigle commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Propogating DataSourceID as a part of Table Identifier for Multi-Shard Reading

Design Decision

Key Changes:

Rationale:

Why it's Safe (Identity Collision Prevention)

Tests

A quick note on PR size

Uh oh!

gemini-code-assist bot commented Apr 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

codecov bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

aasthabharill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

VardhanThigle commented Apr 15, 2026 •

edited

Loading

codecov bot commented Apr 15, 2026 •

edited

Loading