The process crashed (model using Qwen2.5-7B-Instruct, the other hyperparameters are the same as defaults):
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffe09b83ecb91ca7fc4fc74ef703000000 Worker ID: 8caa2562766155907ac59f688e0052f320e18e6686830a75587f56d9 Node ID: 191d4299b7f18aaa9b43048c0aa1f155667f7160fd306392d4fc73f6 Worker IP address: 172.24.29.188 Worker port: 10244 Worker PID: 264399 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Traceback (most recent call last):
File "/home/aiops/root/miniconda3/envs/cu/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/aiops/root/miniconda3/envs/cu/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/aiops/root/CURE/optimization/train.py", line 137, in <module>
asyncio.run(exp.run())
File "/home/aiops/root/miniconda3/envs/cu/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/aiops/root/miniconda3/envs/cu/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/aiops/root/CURE/optimization/train_utils/base_exp.py", line 337, in run
await self.trainer.train()
File "/home/aiops/root/CURE/optimization/train_utils/rl/trainer.py", line 86, in train
await self.policy_model.async_save_model(self.tokenizer, 1)
File "/home/aiops/root/CURE/optimization/train_utils/rl/actors.py", line 338, in async_save_model
return await asyncio.gather(*save_tasks)
File "/home/aiops/root/miniconda3/envs/cu/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
return (yield from awaitable.__await__())
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: PolicyRayActorBase
actor_id: e09b83ecb91ca7fc4fc74ef703000000
pid: 264399
namespace: 173b4d32-7d28-4ae4-b2d7-f4661f15dba1
ip: 172.24.29.188
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
However, when I use Qwen2.5-0.5B, the process runs correctly.
If it is the problem of OOM, then it is weird, because the code is running on a node with 1000GB RAM and 8*A100-80G.
When it runs at:
CURE/optimization/train_utils/rl/trainer.py
Line 86 in 6c2e0ac
The process crashed (model using Qwen2.5-7B-Instruct, the other hyperparameters are the same as defaults):
However, when I use Qwen2.5-0.5B, the process runs correctly.
If it is the problem of OOM, then it is weird, because the code is running on a node with 1000GB RAM and 8*A100-80G.