Skip to content

[Train] Race condition in PlacementGroupCleaner: newly registered PG orphaned on controller death #62753

@SohamRajpure

Description

@SohamRajpure

What happened + What you expected to happen

Bug

PlacementGroupCleaner._monitor_loop has a race condition where a placement group registered by the controller can be orphaned if the controller dies at the wrong moment.

This can happen during worker group restarts, where the controller replaces an old PG with a new one and registers it just before crashing.

Expected behavior

The cleaner should clean up the most recently registered PG.

Actual behavior

The cleaner exits with a stale curr_placement_group, leaving the new PG as an orphaned resource in the Ray cluster.

Versions / Dependencies

Ray v2.55.0

Reproduction script

Race window

  1. queue.get(timeout=check_interval_s) times out — curr_placement_group is stale
  2. Controller puts new_pg into the queue and dies
  3. is_actor_alive returns False
  4. Cleaner calls _cleanup_placement_group(curr_placement_group) with the stale value
  5. Cleaner exits — new_pg is never cleaned up

Issue Severity

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcommunity-backlogstabilitytrainRay Train Related IssuetriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    Status

    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions