[Train] Race condition in PlacementGroupCleaner: newly registered PG orphaned on controller death

### What happened + What you expected to happen

## Bug

`PlacementGroupCleaner._monitor_loop` has a race condition where a placement group registered by the controller can be orphaned if the controller dies at the wrong moment.

This can happen during worker group restarts, where the controller replaces an old PG with a new one and registers it just before crashing.

## Expected behavior
The cleaner should clean up the most recently registered PG.

## Actual behavior
The cleaner exits with a stale curr_placement_group, leaving the new PG as an orphaned resource in the Ray cluster.

### Versions / Dependencies

Ray v2.55.0

### Reproduction script

## Race window

1. queue.get(timeout=check_interval_s) times out — `curr_placement_group` is stale
2. Controller puts `new_pg` into the queue and dies
3. `is_actor_alive` returns False
4. Cleaner calls `_cleanup_placement_group(curr_placement_group)` with the stale value
5. Cleaner exits — `new_pg` is never cleaned up

### Issue Severity

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Race condition in PlacementGroupCleaner: newly registered PG orphaned on controller death #62753

What happened + What you expected to happen

Bug

Expected behavior

Actual behavior

Versions / Dependencies

Reproduction script

Race window

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Train] Race condition in PlacementGroupCleaner: newly registered PG orphaned on controller death #62753

Description

What happened + What you expected to happen

Bug

Expected behavior

Actual behavior

Versions / Dependencies

Reproduction script

Race window

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions