Skip to content

Commit de9eb41

Browse files
joechenrhclaude
andcommitted
test(dm): kill dm-masters sequentially in cleanup_process
In multi-master HA tests (3-node etcd cluster), sending SIGHUP to all masters simultaneously causes etcd to lose quorum — each master tries to transfer leadership but no peer can accept it. The leader transfer blocks for 120s, failing the test. Fix: kill dm-masters one at a time (SIGHUP + 30s wait per master), so each graceful shutdown completes while quorum is maintained. Escalate to SIGKILL after 30s for any stuck master. Workers and syncers are still killed in parallel (no quorum dependency). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 2caf0e5 commit de9eb41

File tree

1 file changed

+17
-5
lines changed

1 file changed

+17
-5
lines changed

dm/tests/_utils/test_prepare

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,27 @@ function cleanup_data_upstream() {
1818
}
1919

2020
function cleanup_process() {
21-
dm_master_num=$(ps aux >temp && grep "dm-master.test" temp | wc -l && rm temp)
22-
echo "$dm_master_num dm-master alive"
23-
pkill -hup dm-master.test 2>/dev/null || true
21+
# Kill dm-masters one at a time to maintain etcd quorum during graceful
22+
# shutdown. Killing all 3 simultaneously causes etcd to lose quorum,
23+
# blocking leader transfer indefinitely.
24+
local pids
25+
pids=$(pgrep -f dm-master.test || true)
26+
echo "$(echo "$pids" | wc -w) dm-master alive"
27+
for pid in $pids; do
28+
kill -HUP $pid 2>/dev/null || true
29+
for _ in $(seq 1 30); do
30+
if ! kill -0 $pid 2>/dev/null; then break; fi
31+
sleep 1
32+
done
33+
# Escalate if still alive after 30s
34+
kill -9 $pid 2>/dev/null || true
35+
done
2436

25-
dm_worker_num=$(ps aux >temp && grep "dm-worker.test" temp | wc -l && rm temp)
37+
dm_worker_num=$(pgrep -c -f dm-worker.test || true)
2638
echo "$dm_worker_num dm-worker alive"
2739
pkill -hup dm-worker.test 2>/dev/null || true
2840

29-
dm_syncer_num=$(ps aux >temp && grep "dm-syncer.test" temp | wc -l && rm temp)
41+
dm_syncer_num=$(pgrep -c -f dm-syncer.test || true)
3042
echo "$dm_syncer_num dm-syncer alive"
3143
pkill -hup dm-syncer.test 2>/dev/null || true
3244

0 commit comments

Comments
 (0)