feat(bigtable): add protobuf decoding to Bigtable Change Streams to BigQuery by MattiasMTS · Pull Request #3572 · GoogleCloudPlatform/DataflowTemplates

MattiasMTS · 2026-03-27T15:43:23Z

Summary

Adds optional protobuf decoding support to the Bigtable Change Streams to BigQuery template. When configured, matching cell values are decoded from proto binary to JSON and written as a STRING to the BigQuery value column.

This fills a gap: the Pub/Sub Proto to BigQuery template already supports proto decoding, but the Bigtable CDC template does not — despite Bigtable being a common store for protobuf-encoded data (as evidenced by the Query protobuf data docs).

New parameters (all optional, backwards compatible)

Parameter	Description
`protoSchemaPath`	GCS path to a self-contained `FileDescriptorSet` (.pb) file
`fullProtoMessageName`	Fully qualified proto message name (e.g. `package.MessageName`)
`protoColumnFamily`	Column family containing proto-encoded values
`protoColumn`	Column qualifier containing proto-encoded values
`preserveProtoFieldNames`	Use snake_case (true) or camelCase (false, default) in JSON

When all four required params are set, SetCell entries matching the configured column family and qualifier are decoded via DynamicMessage.parseFrom() + JsonFormat.printer(). Non-matching cells and decode failures fall back to the existing raw value behavior.

Reuses existing infrastructure

SchemaUtils.getProtoDomain() / SchemaUtils.createBigQuerySchema() from v2/common
DynamicMessage + JsonFormat.Printer (same pattern as PubsubProtoToBigQuery)
ProtoDecoder is Serializable with lazy init of transient Descriptor/Printer for worker-side initialization

Files changed

BigtableChangeStreamToBigQueryOptions.java — 5 new pipeline parameters
BigtableChangeStreamsToBigQuery.java — validation, ProtoDecoder wiring, updated ChangeStreamMutationToTableRowFn
Mod.java — proto-aware SetCell constructor
BigQueryUtils.java — VALUE_STRING formatter checks for decoded proto JSON
ProtoDecoder.java — new class, ~120 lines
BigQueryUtilTest.java — 5 new tests covering decode, field name preservation, fallback, and backwards compat

Test plan

All 13 unit tests pass (8 existing + 5 new) via mvn test -Dtest=BigQueryUtilTest
mvn spotless:check passes
mvn compile succeeds
Integration test with real Bigtable change stream + proto data (manual)

gemini-code-assist · 2026-03-27T15:43:33Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the Bigtable Change Streams to BigQuery template by enabling the decoding of protobuf-encoded data directly within the pipeline. By providing a schema descriptor file, users can now automatically transform binary Bigtable cell values into readable JSON format in BigQuery. The implementation is designed to be backwards compatible and includes robust error handling to ensure data integrity when decoding fails or when specific columns do not require transformation.

Highlights

New Proto Decoding Feature: Added optional protobuf decoding support to the Bigtable Change Streams to BigQuery template, allowing binary cell values to be converted to JSON.
Configuration Parameters: Introduced five new optional pipeline parameters: protoSchemaPath, fullProtoMessageName, protoColumnFamily, protoColumn, and preserveProtoFieldNames.
New Utility Class: Created ProtoDecoder.java to handle the dynamic parsing of protobuf messages using FileDescriptorSet files.
Robust Fallback Mechanism: Implemented logic to ensure that non-matching cells or decoding failures gracefully fall back to the existing raw value behavior.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

google-cla · 2026-03-27T15:43:33Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Add optional proto decoding support to the Bigtable Change Streams to BigQuery template. When configured, matching cell values are decoded from proto binary to JSON and written as STRING to the BigQuery value column. New parameters: protoSchemaPath, fullProtoMessageName, protoColumnFamily, protoColumn, preserveProtoFieldNames. All optional, backwards compatible. Uses DynamicMessage + JsonFormat.Printer from the proto-java library, with lazy synchronized init on workers for the non-serializable Descriptor.

Add a generic columnTransforms parameter for mapping column_family:column pairs to value transformers. Proto decode and column transforms share a unified TRANSFORMED_VALUE output path in BigQuery. Supported transforms: - BIG_ENDIAN_UINT64_TIMESTAMP_MS: 8-byte big-endian int64 → timestamp Architecture: ValueTransformer interface + ValueTransformerRegistry with two-level family→column map for O(1) lookup without per-element string concatenation. Transform priority: proto decode first, then columnTransforms. Includes proto decode parameters, options, pipeline wiring, BigQueryUtils formatter integration, and tests for both features.

stankiewicz · 2026-04-07T11:19:29Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for decoding protobuf-encoded cell values and applying custom transformations to Bigtable change stream data before it is written to BigQuery. Key additions include a ProtoDecoder for schema-based JSON conversion and a ValueTransformerRegistry for handling specific data types like big-endian uint64 timestamps. The feedback identifies a potential issue in ValueTransformerRegistry.java where parsing configuration strings with split(":") may fail if column qualifiers contain colons, and suggests a more robust parsing approach using first and last index lookups.

gemini-code-assist · 2026-04-07T11:21:21Z

+      String[] parts = trimmed.split(":");
+      if (parts.length != 3) {
+        throw new IllegalArgumentException(
+            "Invalid columnTransforms entry '"
+                + trimmed
+                + "'. Expected format: column_family:column:TRANSFORM_TYPE");
+      }
+      String family = parts[0];
+      String column = parts[1];
+      String type = parts[2];


Using split(":") to parse the configuration string is fragile because Bigtable column qualifiers can contain colons. Since the column_family name is restricted and cannot contain colons, and the TRANSFORM_TYPE is the last segment, it is safer to find the first and last colons to extract the components. This ensures that column qualifiers containing colons are handled correctly.

Suggested change

String[] parts = trimmed.split(":");

if (parts.length != 3) {

throw new IllegalArgumentException(

"Invalid columnTransforms entry '"

+ trimmed

+ "'. Expected format: column_family:column:TRANSFORM_TYPE");

}

String family = parts[0];

String column = parts[1];

String type = parts[2];

int firstColon = trimmed.indexOf(':');

int lastColon = trimmed.lastIndexOf(':');

if (firstColon == -1 || firstColon == lastColon) {

throw new IllegalArgumentException(

"Invalid columnTransforms entry '"

+ trimmed

+ "'. Expected format: column_family:column:TRANSFORM_TYPE");

}

String family = trimmed.substring(0, firstColon);

String column = trimmed.substring(firstColon + 1, lastColon);

String type = trimmed.substring(lastColon + 1);

stankiewicz · 2026-04-07T11:29:27Z

Please also add new test to BigtableChangeStreamsToBigQueryIT

codecov · 2026-04-07T11:32:47Z

Codecov Report

❌ Patch coverage is 0% with 125 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.08%. Comparing base (2f91d55) to head (024da4e).
⚠️ Report is 55 commits behind head on main.

Files with missing lines	Patch %	Lines
...ngestreamstobigquery/schemautils/ProtoDecoder.java	0.00%	35 Missing ⚠️
...bigquery/schemautils/ValueTransformerRegistry.java	0.00%	31 Missing ⚠️
...amstobigquery/BigtableChangeStreamsToBigQuery.java	0.00%	27 Missing ⚠️
...tes/bigtablechangestreamstobigquery/model/Mod.java	0.00%	17 Missing ⚠️
...ery/schemautils/BigEndianTimestampTransformer.java	0.00%	13 Missing ⚠️
...gestreamstobigquery/schemautils/BigQueryUtils.java	0.00%	2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3572      +/-   ##
============================================
- Coverage     52.12%   52.08%   -0.05%     
- Complexity     5644     5646       +2     
============================================
  Files          1040     1043       +3     
  Lines         63118    63181      +63     
  Branches       6922     6934      +12     
============================================
+ Hits          32903    32905       +2     
- Misses        27981    28044      +63     
+ Partials       2234     2232       -2

Components	Coverage Δ
spanner-templates	`72.10% <ø> (+0.01%)`	⬆️
spanner-import-export	`68.76% <ø> (+0.02%)`	⬆️
spanner-live-forward-migration	`80.34% <ø> (-0.04%)`	⬇️
spanner-live-reverse-replication	`77.81% <ø> (+0.01%)`	⬆️
spanner-bulk-migration	`89.17% <ø> (-0.02%)`	⬇️
gcs-spanner-dv	`85.30% <ø> (-0.05%)`	⬇️

Files with missing lines	Coverage Δ
...gestreamstobigquery/schemautils/BigQueryUtils.java	`0.00% <0.00%> (ø)`
...ery/schemautils/BigEndianTimestampTransformer.java	`0.00% <0.00%> (ø)`
...tes/bigtablechangestreamstobigquery/model/Mod.java	`0.00% <0.00%> (ø)`
...amstobigquery/BigtableChangeStreamsToBigQuery.java	`0.00% <0.00%> (ø)`
...bigquery/schemautils/ValueTransformerRegistry.java	`0.00% <0.00%> (ø)`
...ngestreamstobigquery/schemautils/ProtoDecoder.java	`0.00% <0.00%> (ø)`

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

stankiewicz

great contribution! Few comments added

stankiewicz · 2026-04-07T11:38:33Z

+  @Default.String("")
+  String getFullProtoMessageName();
+
+  void setFullProtoMessageName(String value);


thanks for being consistent with src/main/java/com/google/cloud/teleport/v2/templates/PubsubProtoToBigQuery.java

stankiewicz · 2026-04-07T12:00:04Z

+              + "(8-byte big-endian unsigned 64-bit integer as Unix epoch milliseconds, "
+              + "converted to a timestamp string).")
+  @Default.String("")
+  String getColumnTransforms();


There is writeNumericTimestamps that converts int64 to timestamp for limited amount of columns. In your change, you are not only converting to some format (String in this case), but you also change endianness. Why not to timestamp type? why not to little endian uint? This looks like UDF to me with very limited scope that would require releasing a template whenever you want to add new transform.
Would JS UDF be good alternative?

stankiewicz · 2026-04-07T12:21:48Z

+      return null;
+    }
+    try {
+      long millis = ByteBuffer.wrap(bytes).getLong();


add .order(ByteOrder.BIG_ENDIAN) to make it explicit and make code self documenting.

stankiewicz · 2026-04-07T12:24:51Z

+  /**
+   * Parses a comma-separated transform configuration string.
+   *
+   * @param config format: "family:column:TYPE,family2:column2:TYPE2"


TYPE is confusing, TRANSFORM_TYPE is used in other parts of javadoc.

stankiewicz · 2026-04-07T12:38:36Z

+    }
+
+    Map<String, Map<String, ValueTransformer>> transformersByFamily = new HashMap<>();
+    for (String entry : config.split(",")) {


column name can be any value, can contain comma. Maybe state somewhere about such limitation.

stankiewicz · 2026-04-07T12:45:45Z

+  private static ValueTransformer createTransformer(String type) {
+    switch (type) {
+      case "BIG_ENDIAN_UINT64_TIMESTAMP_MS":
+        return new BigEndianTimestampTransformer();
+      default:
+        throw new IllegalArgumentException(
+            "Unknown transform type '"
+                + type
+                + "'. Supported types: BIG_ENDIAN_UINT64_TIMESTAMP_MS");
+    }
+  }
+}


while few transformers look small, it's cleaner and more efficient to reuse them. tranformers are stateless and thread safe (maybe worth writing this somewhere), you can implement memoization strategy:

private static final Map<String, ValueTransformer> TRANSFORMER_CACHE = new HashMap<>(); private static ValueTransformer createTransformer(String type) { return TRANSFORMER_CACHE.computeIfAbsent(type, t -> { switch (t) { case "BIG_ENDIAN_UINT64_TIMESTAMP_MS": return new BigEndianTimestampTransformer(); default: throw new IllegalArgumentException("Unknown transform type '" + t + "'"); } }); }

stankiewicz · 2026-04-07T12:50:45Z

+    boolean hasProtoColumn = !StringUtils.isBlank(options.getProtoColumn());
+
+    if (hasProtoSchema || hasProtoMessage || hasProtoColumnFamily || hasProtoColumn) {
+      if (!hasProtoSchema || !hasProtoMessage || !hasProtoColumnFamily || !hasProtoColumn) {


difficult to read.. maybe:

boolean anyProtoSet = hasProtoSchema || hasProtoMessage || hasProtoColumnFamily || hasProtoColumn; boolean allProtoSet = hasProtoSchema && hasProtoMessage && hasProtoColumnFamily && hasProtoColumn; if (anyProtoSet && !allProtoSet) { throw new IllegalArgumentException( [..]

stankiewicz · 2026-04-07T12:51:37Z

+            "When using protobuf decoding, all of protoSchemaPath, fullProtoMessageName, "
+                + "protoColumnFamily, and protoColumn must be specified.");
+      }
+      protoDecoder =


ProtoDecoder protoDecoder = null; if (allProtoSet) { protoDecoder = new ProtoDecoder( [..]

stankiewicz · 2026-04-07T13:03:57Z

+  @TemplateParameter.Text(
+      order = 14,
+      optional = true,
+      description = "Column qualifier containing proto values",


Grouping together multiple data into single proto column is a best practice, but it's not always the case, what if you have second proto in other column family or another column.

stankiewicz · 2026-04-07T13:26:08Z

PubSub Proto to BQ has single proto so it's ok that config is only for single proto message. In BT there is a limit of 10 schema bundles per table, which suggest that there may be multiple proto types used. Eventually you may end up creating another transform_type: foo:bar:PROTO_CAST(package.name.OuterMessage.InnerMessage) and slightly more complex cache.

Address review comments: - Multi-proto support: PROTO_DECODE(package.MessageName) transform type supports multiple proto columns with different message types - Fix colon parsing with indexOf/lastIndexOf (column qualifiers can contain colons) - Add transformer memoization for shared instances - Rename TYPE to TRANSFORM_TYPE consistently - Explicit ByteOrder.BIG_ENDIAN in timestamp transformer - Simplify validation with anyProtoSet/allProtoSet pattern - Legacy proto options auto-translate to columnTransforms entries - Document comma limitation and JS UDF alternative

… decode - testBigtableChangeStreamsToBigQueryColumnTransform: verifies BIG_ENDIAN_TIMESTAMP transform converts 8-byte uint64 to timestamp - testBigtableChangeStreamsToBigQueryProtoDecodeViaTransform: verifies PROTO_DECODE() transform type with programmatically built descriptor

- Delete dead ProtoDecoder.java (replaced by ProtoDecodeTransformer) - Use double-checked locking in ensureInitialized() to avoid synchronized on every record in the hot path - Use two-level map (family -> column -> transformer) in ValueTransformerRegistry to avoid string concat per record - Deduplicate Mod constructors (3-arg delegates to 4-arg)

stankiewicz · 2026-04-17T05:49:08Z

Let me know when it's ready.

MattiasMTS · 2026-04-17T09:03:02Z

Let me know when it's ready.

yes absolutely I'll let you know, boss! It's not ready yet. Sorry, been swamped in other stuff 😿 I'll try to focus getting this in shortly with high quality.

…ement overflow Bigtable cell values can reach 100 MB (vs Pub/Sub's 10 MB source cap), so proto→JSON decoding can inflate a single row past Dataflow Windmill's 80 MB per-element commit limit and freeze the partition. Bound the decoded output and route overflows to the existing severe DLQ with metadata only. - Add maxDecodedValueBytes template option (default 10_000_000 bytes to match BigQuery Storage Write API row-size limit). - New Utf8BoundedAppendable + OversizedJsonException that track exact UTF-8 byte counts per code point and abort mid-serialization on overflow. - ProtoDecodeTransformer.transformBounded: cheap raw-size pre-check plus bounded JsonFormat.appendTo; returns a TransformResult { SUCCESS, DECODE_ERROR, OVERSIZED, NO_TRANSFORMER } value type. The existing unbounded transform(byte[]) entry point is preserved. - ValueTransformerRegistry.transformBounded dispatches to the bounded proto path and falls back to the unbounded path with a best-effort size check for non-proto transformers. - Mod carries a transient TransformResult the pipeline inspects; new 5-arg constructor threads maxDecodedValueBytes through (old 4-arg constructor delegates with Long.MAX_VALUE for backward compatibility). - ChangeStreamMutationToTableRowFn now emits MAIN_OUT (TableRow) and OVERSIZED_DLQ_OUT (String) tagged outputs, increments an oversizedDecodes counter plus a decodedValueBytes distribution, and writes a compact per-row metadata record to the existing severe DLQ sink. Unit tests cover UTF-8 byte counting (ASCII, 2/3-byte chars, surrogate pairs, boundary), the cheap-path and mid-decode oversized paths, DECODE_ERROR for malformed bytes, NO_TRANSFORMER dispatch, and the non-proto size bound. An IT case exercises the pipeline end-to-end with a 10 KB cap, asserting the small cell reaches BigQuery and the oversized cell's metadata reaches the DLQ instead.

Interface: - Move transformBounded onto ValueTransformer as a default method; override in ProtoDecodeTransformer. - Drop instanceof branch from ValueTransformerRegistry.transformBounded; it now just dispatches via the interface. - Remove the transient TransformResult field / getter from Mod. A new Mod.buildSetCell(...) static factory returns SetCellBuildResult (Mod + TransformResult). Mod's existing 4-arg constructor is preserved for backward compatibility (still no bound). Reuse: - Replace the bespoke oversized-DLQ JSON code with a DeadLetterQueueSanitizer pattern: new OversizedValue data class as the side-output element type and new OversizedValueDlqSanitizer that the pipeline pipes through a MapElements before DLQWriteTransform. - Use ChangelogColumn.getBqColumnName() for the canonical JSON keys (row_key, column_family, column, source_instance/cluster/table, commit_timestamp). Named constants for the non-column fields and the reason literal. - Drop all Jackson plumbing from BigtableChangeStreamsToBigQuery; ThreadLocal<Gson> in the sanitizer matches the neighbouring BigQueryDeadLetterQueueSanitizer. Efficiency: - Only call setCell.getValue().toByteArray() when a transformer is registered for the column; 100 MB cells aren't copied when nobody consumes them. - Utf8BoundedAppendable now uses com.google.common.base.Utf8.encodedLength per segment instead of per-codepoint arithmetic, preserving abort-mid-decode semantics. - Pre-size the StringBuilder to a capped maxBytes (64 .. 1 MiB). - Simplify commit timestamp formatting to Instant#toString (ISO-8601 UTC, nanosecond precision). Comments: - OversizedJsonException#fillInStackTrace short-circuits (unwind-only control flow). - Tighten Mod / ProtoDecodeTransformer comments to WHY, drop the Jackson LinkedHashMap note, drop the dangling transformResult javadoc. Tests: - Updated ValueTransformerRegistryTest for the interface change and added a case exercising the default transformBounded on a non-proto transformer. - New OversizedValueDlqSanitizerTest covers the JSON schema and the null-timestamp case. - Updated BigtableChangeStreamsToBigQueryIT to parse the sanitizer envelope ({message, error_message}).

pull-request-size bot added the size/L label Mar 27, 2026

pull-request-size bot added size/XL and removed size/L labels Mar 30, 2026

MattiasMTS added 2 commits March 30, 2026 21:45

MattiasMTS force-pushed the ms/bt-bq-proto branch from 0bb6d5e to 024da4e Compare March 30, 2026 19:46

stankiewicz added the addition New feature or request label Apr 7, 2026

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

stankiewicz requested changes Apr 7, 2026

View reviewed changes

stankiewicz requested a review from a team April 7, 2026 13:09

stankiewicz requested a review from jackdingilian April 9, 2026 09:27

MattiasMTS added 3 commits April 14, 2026 12:23

pull-request-size bot added size/XXL and removed size/XL labels Apr 18, 2026

Conversation

MattiasMTS commented Mar 27, 2026

Summary

New parameters (all optional, backwards compatible)

Reuses existing infrastructure

Files changed

Test plan

Uh oh!

gemini-code-assist bot commented Mar 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

google-cla bot commented Mar 27, 2026

Uh oh!

stankiewicz commented Apr 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

stankiewicz commented Apr 7, 2026

Uh oh!

codecov bot commented Apr 7, 2026

Codecov Report

Uh oh!

stankiewicz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stankiewicz commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stankiewicz commented Apr 17, 2026

Uh oh!

MattiasMTS commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stankiewicz commented Apr 7, 2026 •

edited

Loading