Skip to content

test(dm): add MariaDB source smoke test and next-gen integration test#12599

Open
joechenrh wants to merge 17 commits intopingcap:masterfrom
joechenrh:mariadb-source-smoke-dm
Open

test(dm): add MariaDB source smoke test and next-gen integration test#12599
joechenrh wants to merge 17 commits intopingcap:masterfrom
joechenrh:mariadb-source-smoke-dm

Conversation

@joechenrh
Copy link
Copy Markdown
Contributor

@joechenrh joechenrh commented Apr 9, 2026

What problem does this PR solve?

Issue Number: ref #12615

What is changed and how it works?

Enable DM integration tests to run on next-gen TiDB (Cloud Storage Engine edition) alongside classic TiDB. All 13 test groups (G00–G11 + TLS_GROUP) pass on next-gen CI.

  • Add MariaDB source smoke integration test case (mariadb_source)
  • Add full next-gen cluster startup (MinIO + PD + TiKV + tikv-worker + SYSTEM TiDB + user TiDB)
  • Adapt test scripts for next-gen compatibility (see table below)
  • Simplify cluster lifecycle with shared functions in cluster_lib.sh
  • Fix pre-existing flaky tests (cleanup_process etcd quorum hang, print_status infinite loop)

Infrastructure changes

  • cluster_lib.sh (new): centralizes cluster lifecycle ops — cleanup_tidb_server, cleanup_downstream_cluster, run_tidb_server, run_downstream_cluster, run_downstream_cluster_with_tls
  • run_downstream_cluster_nextgen (new): starts MinIO + PD + TiKV + tikv-worker + SYSTEM TiDB + user TiDB
  • run_downstream_cluster_with_tls_nextgen (new): restarts user TiDB with client-facing TLS
  • run_downstream_cluster_classic (renamed from run_downstream_cluster): classic PD + TiKV + TiDB
  • env_variables: centralized next-gen vars (PD_ADDR, TIKV_WORKER_ADDR, KEYSPACE_NAME, MINIO_*, etc.) under NEXT_GEN=1 guard
  • run_tidb_server: unified startup — unistore/tikv via PD_ADDR, next-gen keyspace config, TLS detection
  • cleanup_tidb_server: port-4000 targeted (preserves SYSTEM TiDB on 4001), removes temp-storage lock
  • cleanup_process: sequential dm-master kill (maintains etcd quorum), SIGKILL for workers
  • ha_cases_lib.sh: move print_debug_status from ha_cases2 (fix command-not-found in ha_cases3)
  • Don't set tidb_ddl_enable_fast_reorg=0 / tidb_enable_dist_task=0 on next-gen (breaks DXF-based DDL)

Test adaptations for next-gen (by group)

Group Test Change
G02 check_task Replace GRANT ALL with specific privileges + CONFIG
G03 dmctl_basic config diff Session block normalization (next-gen omits tidb_txn_mode)
G05 many_tables Phase 2 import-into mode + existing MinIO instead of Lightning physical
G07 shardddl1 DML merge Threshold relaxed (>2 instead of >5)
G09 openapi test_delete_task cleanup_tidb_server (port-4000 targeted)
G10 new_relay config export/import Session normalization + patch before config import
G10 new_relay / all_mode cleanup_tidb_server instead of pkill tidb-server
G10 import_into_mode PID-targeted MinIO kill (preserve cluster MinIO)
G10 print_status Fix infinite loop (missing i++) + tolerate SIGKILL exit
G11 sync_collation Explicit COLLATE utf8_general_ci (next-gen defaults utf8 to utf8_bin)
G11 sql_mode Remove NO_AUTO_CREATE_USER (not in MySQL 8.0 / next-gen)

Tests skipped on next-gen

Group Test Reason
G09 new_collation_off Next-gen can't disable new collation framework
G09 s3_dumpling_lightning Lightning physical mode version gate (26.x > max 10.0.0)
G09 openapi test_tls Lightning assumes HTTPS on status port; needs cluster-ssl on PD/TiKV
mariadb_source No MariaDB sidecar in next-gen CI pod (new test, not yet in any group)

Check List

Tests

  • Integration test

Questions

Will it cause performance regression or break compatibility?

No. This only affects DM integration test infrastructure. No production code changes.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

None

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 9, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. area/dm Issues or PRs related to DM. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 9, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces integration tests for MariaDB as a data source in DM. It includes environment variable updates, configuration files, test data, and a new test runner script. The main test entry point was also updated to conditionally manage MySQL and MariaDB services. Feedback focuses on improving test isolation and maintainability, specifically by disabling automatic master resets in MariaDB-only environments, ensuring consistent SQL modes and server variables for MariaDB, and using wildcard patterns for test case matching.

Comment thread dm/tests/mariadb_source/run.sh
Comment thread dm/tests/run.sh
Comment thread dm/tests/run.sh
Comment thread dm/tests/run.sh Outdated
@joechenrh
Copy link
Copy Markdown
Contributor Author

/test ?

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 10, 2026

@joechenrh: The following commands are available to trigger required jobs:

/test pull-build
/test pull-cdc-integration-kafka-test
/test pull-cdc-integration-mysql-test
/test pull-cdc-integration-pulsar-test
/test pull-cdc-integration-storage-test
/test pull-check
/test pull-dm-compatibility-test
/test pull-dm-integration-test
/test pull-error-log-review
/test pull-syncdiff-integration-test
/test pull-unit-test-cdc
/test pull-verify
/test wip-pull-unit-test-dm
/test wip-pull-unit-test-engine

The following commands are available to trigger optional jobs:

/test pull-dm-integration-test-next-gen

Use /test all to run the following jobs that were automatically triggered:

pingcap/tiflow/ghpr_verify
pingcap/tiflow/pull_dm_compatibility_test
pingcap/tiflow/pull_dm_integration_test
pingcap/tiflow/pull_dm_integration_test_next_gen
pingcap/tiflow/pull_syncdiff_integration_test
pull-build
pull-check
pull-error-log-review
pull-unit-test-cdc
Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

@joechenrh joechenrh marked this pull request as ready for review April 10, 2026 06:17
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2026
@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

3 similar comments
@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

@joechenrh joechenrh force-pushed the mariadb-source-smoke-dm branch 3 times, most recently from 222df5d to 1907227 Compare April 14, 2026 06:23
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (master@9fbde6e). Learn more about missing BASE report.
⚠️ Report is 4 commits behind head on master.
✅ All tests successful. No failed tests found.

Additional details and impacted files
Components Coverage Δ
cdc 57.3652% <ø> (?)
dm 49.1598% <ø> (?)
engine 50.7110% <ø> (?)
Flag Coverage Δ
cdc 57.3652% <ø> (?)
unit 53.4109% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

@@             Coverage Diff             @@
##             master     #12599   +/-   ##
===========================================
  Coverage          ?   53.4109%           
===========================================
  Files             ?       1011           
  Lines             ?     139975           
  Branches          ?          0           
===========================================
  Hits              ?      74762           
  Misses            ?      59592           
  Partials          ?       5621           
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 14, 2026
@joechenrh joechenrh force-pushed the mariadb-source-smoke-dm branch from 9314f56 to 4293341 Compare April 15, 2026 09:07
@ti-chi-bot ti-chi-bot bot removed the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 15, 2026
Add cleanup_downstream_cluster to test_prepare: handles next-gen
(port-4000 TiDB only) vs classic (tidb+tikv+pd + unistore data)
teardown in one function.

Replace all raw killall/pkill tidb-server patterns across 9 test
scripts with cleanup_tidb_server or cleanup_downstream_cluster.
This eliminates ~30 duplicated kill+wait+cleanup lines and ensures
next-gen SYSTEM TiDB is preserved consistently.

Files simplified: new_collation_off, tls, openapi, many_tables,
lightning_mode, s3_dumpling_lightning, import_into_mode, util.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joechenrh joechenrh force-pushed the mariadb-source-smoke-dm branch from 83fe810 to ef58291 Compare April 17, 2026 10:09
The .normalized files were written inside /tmp/configs/tasks/ which
config import scans for task configs. Move normalized copies to /tmp/
with distinct names so they don't interfere with config import.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

1 similar comment
@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

… port

Lightning's loader fetches TiDB settings via the HTTP status port
(10080). When TiDB has [security] ssl-* configured, Lightning assumes
the status port serves HTTPS. But ssl-* only enables TLS on the
mysql port — the status port needs cluster-ssl-* (which requires
TLS-enabled PD/TiKV) to serve HTTPS. Skip until the next-gen
cluster supports full TLS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

@joechenrh joechenrh force-pushed the mariadb-source-smoke-dm branch from 2e2261a to 1c87655 Compare April 17, 2026 15:21
In multi-master HA tests (3-node etcd cluster), sending SIGHUP to
all masters simultaneously causes etcd to lose quorum — each master
tries to transfer leadership but no peer can accept it. The leader
transfer blocks for 120s, failing the test.

Fix: kill dm-masters one at a time (SIGHUP + 30s wait per master),
so each graceful shutdown completes while quorum is maintained.
Escalate to SIGKILL after 30s for any stuck master. Workers and
syncers are still killed in parallel (no quorum dependency).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joechenrh joechenrh force-pushed the mariadb-source-smoke-dm branch from 1c87655 to de9eb41 Compare April 17, 2026 15:23
@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

joechenrh and others added 2 commits April 17, 2026 12:02
…aster kill

Two fixes:

1. many_tables Phase 2: run_downstream_cluster already starts TiDB on
   port 4000 (classic). The extra run_tidb_server after it tried to
   start a second TiDB → fslock crash → Phase 2 failure → worker stuck.
   Now run_tidb_server only runs on next-gen (where run_downstream_cluster
   is not called).

2. cleanup_process: kill dm-masters one at a time (SIGHUP + 30s wait)
   to maintain etcd quorum during graceful shutdown. Previously all 3
   masters received SIGHUP simultaneously → etcd lost quorum → leader
   transfer blocked 120s. Workers use SIGKILL directly (can be stuck
   in long Lightning loads).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Missing i++ in the dm-worker exit log wait loop caused an infinite
   loop when the log message wasn't found (exposed on next-gen where
   cleanup_process uses SIGKILL — worker doesn't log exit message).

2. Make the timeout non-fatal since the exit log is just a flush
   indicator, not a required assertion. The actual status checks
   follow after.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

- Fix shfmt indentation: cluster_lib.sh (4-space→tab), run_group.sh
  (space-tab→tab on G10), run.sh (case statement extra tab)
- mariadb_source: set RESET_MASTER=false before sourcing test_prepare
- run.sh: add MariaDB SQL_MODE cleanup in stop_services
- run.sh: add set_default_variables for MariaDB in start_services
- run.sh: widen case pattern from mariadb_source) to mariadb_*)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
joechenrh and others added 5 commits April 19, 2026 22:55
…rver

Classic TiDB (unistore) accepts -keyspace-name and -tidb-service-scope
as no-ops, verified locally. Remove the NEXT_GEN guard so both classic
and next-gen use the same config path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Classic TiDB rejects keyspace-name in config ("invalid config: keyspace
name or standby mode is not supported for classic TiDB"). Restore the
NEXT_GEN guard in run_tidb_server and tls/run.sh.

Also remove the diagnostic DROP DATABASE logging from run_sql that was
left over from debugging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mplify run.sh

- env_variables: add TIDB_EXTRA_ARGS under NEXT_GEN guard
- tls/run.sh: use TIDB_EXTRA_ARGS instead of inline NEXT_GEN check
- test_prepare: add shared normalize_session_block() function
- config.sh, new_relay/run.sh: use normalize_session_block()
- run.sh: move test_case parsing back to original position, remove
  redundant initial need_mariadb/need_mysql defaults

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes pull-check failure: "mariadb_source is not added to any group"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MariaDB sidecar is not yet available in CI (PingCAP-QE/ci#4496).
Exclude mariadb_source from group assignment and the "check others"
validation until the CI change is merged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

((i++)) when i=0 returns exit code 1 (pre-increment value is 0),
which set -e treats as failure. Use i=$((i + 1)) instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/dm Issues or PRs related to DM. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant