Skip to content

FAILED: async_save_model in train.py #2

@cJamesSmith

Description

@cJamesSmith

When it runs at:

await self.policy_model.async_save_model(self.tokenizer, 1)

The process crashed (model using Qwen2.5-7B-Instruct, the other hyperparameters are the same as defaults):

(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffe09b83ecb91ca7fc4fc74ef703000000 Worker ID: 8caa2562766155907ac59f688e0052f320e18e6686830a75587f56d9 Node ID: 191d4299b7f18aaa9b43048c0aa1f155667f7160fd306392d4fc73f6 Worker IP address: 172.24.29.188 Worker port: 10244 Worker PID: 264399 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Traceback (most recent call last):
  File "/home/aiops/root/miniconda3/envs/cu/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/aiops/root/miniconda3/envs/cu/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/aiops/root/CURE/optimization/train.py", line 137, in <module>
    asyncio.run(exp.run())
  File "/home/aiops/root/miniconda3/envs/cu/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/home/aiops/root/miniconda3/envs/cu/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/aiops/root/CURE/optimization/train_utils/base_exp.py", line 337, in run
    await self.trainer.train()
  File "/home/aiops/root/CURE/optimization/train_utils/rl/trainer.py", line 86, in train
    await self.policy_model.async_save_model(self.tokenizer, 1)
  File "/home/aiops/root/CURE/optimization/train_utils/rl/actors.py", line 338, in async_save_model
    return await asyncio.gather(*save_tasks)
  File "/home/aiops/root/miniconda3/envs/cu/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
        class_name: PolicyRayActorBase
        actor_id: e09b83ecb91ca7fc4fc74ef703000000
        pid: 264399
        namespace: 173b4d32-7d28-4ae4-b2d7-f4661f15dba1
        ip: 172.24.29.188
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

However, when I use Qwen2.5-0.5B, the process runs correctly.

If it is the problem of OOM, then it is weird, because the code is running on a node with 1000GB RAM and 8*A100-80G.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions