Graph size multi shard table identifier#3697
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a dataSourceId to the TableIdentifier to support multi-shard reading in the SourceDB to Spanner pipeline. By qualifying table identifiers with a data source ID, the system can now correctly distinguish between tables with the same name residing on different physical shards, preventing data aggregation errors and ensuring correct routing in Dataflow. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3697 +/- ##
============================================
+ Coverage 57.66% 58.67% +1.00%
+ Complexity 2570 2124 -446
============================================
Files 533 505 -28
Lines 30172 29134 -1038
Branches 3305 3192 -113
============================================
- Hits 17400 17095 -305
+ Misses 11774 11061 -713
+ Partials 998 978 -20
🚀 New features to boost your workflow:
|
f3c3064
into
GoogleCloudPlatform:main
Propogating DataSourceID as a part of Table Identifier for Multi-Shard Reading
This is the third child of #3684 .
Design Decision
This PR adds
dataSourceIdto the tableIdentifer.As a part of Milestone-1, a
Rangenow has theTableIdentifierwhich helps move the table level details fromPtransformconstruction to the data layer. As was discussed in the same Milestone, the tableIdentifier was design to be extensible by allowing the DataSourceID to be contained in it. This helps in allowing the reader to correctly select the datasource that the given Range is associated and also eliminates the need to actually embed (and shuffle) the entire dataSource as a part of the range. The DataSourceProvider checked in as a part of #3685 will be provided to the Reader at construction time and that map coupled with the ID provided by the range as a part of this PR will help the Reader connect to the right source.Key Changes:
TableIdentifierEnhancement: Added adataSourceId()field to correctly distinguish tables with the same name that exist on different physical shards.JdbcSchemaReferenceIntegration: Propagated thedataSourceIdinto the schema reference layer.JdbcIOWrapperConfigto ensure every shard configuration has a unique identifier throughout the pipeline lifecycle.Rationale:
In a multi-shard environment, a table name (e.g., "Users") is no longer unique. We must qualify every table with its source database instance to ensure that Dataflow's grouping and splitting logic routes data to the correct destination.
Why it's Safe (Identity Collision Prevention)
shardIDandtableNameinto a single identifier, we prevent collisions that would previously have caused Dataflow to aggregate data from different physical sources into a single logical partition.equals()andhashCode()inTableIdentifierto include thedataSourceId. This is critical for Beam'sGroupByKeyoperations, ensuring that ranges from different shards are never incorrectly merged.@Nullableannotations where appropriate to ensure the system handles legacy configurations that may not have an explicitdataSourceId.Tests
The added tests verify that:
--
A quick note on PR size
This PR is divided into 2 commits - the first one just adds the identifier.
The second one ensures that the builder is correctly initialized in existing UTs - given the spotless requirement, this appears to have gotten into many lines (though in essentiality it's not)