What happened + What you expected to happen
Bug
PlacementGroupCleaner._monitor_loop has a race condition where a placement group registered by the controller can be orphaned if the controller dies at the wrong moment.
This can happen during worker group restarts, where the controller replaces an old PG with a new one and registers it just before crashing.
Expected behavior
The cleaner should clean up the most recently registered PG.
Actual behavior
The cleaner exits with a stale curr_placement_group, leaving the new PG as an orphaned resource in the Ray cluster.
Versions / Dependencies
Ray v2.55.0
Reproduction script
Race window
- queue.get(timeout=check_interval_s) times out —
curr_placement_group is stale
- Controller puts
new_pg into the queue and dies
is_actor_alive returns False
- Cleaner calls
_cleanup_placement_group(curr_placement_group) with the stale value
- Cleaner exits —
new_pg is never cleaned up
Issue Severity
None
What happened + What you expected to happen
Bug
PlacementGroupCleaner._monitor_loophas a race condition where a placement group registered by the controller can be orphaned if the controller dies at the wrong moment.This can happen during worker group restarts, where the controller replaces an old PG with a new one and registers it just before crashing.
Expected behavior
The cleaner should clean up the most recently registered PG.
Actual behavior
The cleaner exits with a stale curr_placement_group, leaving the new PG as an orphaned resource in the Ray cluster.
Versions / Dependencies
Ray v2.55.0
Reproduction script
Race window
curr_placement_groupis stalenew_pginto the queue and diesis_actor_alivereturns False_cleanup_placement_group(curr_placement_group)with the stale valuenew_pgis never cleaned upIssue Severity
None