Enhances ORCA memory troubleshooting and pipe task handling#874
Enhances ORCA memory troubleshooting and pipe task handling#874
Conversation
There was a problem hiding this comment.
Pull request overview
Improves ARC’s handling of memory-related ESS failures (especially ORCA and node memory caps) and enhances pipe-mode behavior for FAILED_ESS tasks by routing them into targeted ESS troubleshooting when parser details are available.
Changes:
- Refreshes and leverages scheduler-side job logs for improved OOM detection and preserves
max_total_job_memoryacross ESS status parsing. - Refines ORCA troubleshooting to reduce CPU cores when total memory is capped, and tightens generic “memory” error handling in the scheduler.
- Updates pipe ingestion so FAILED_ESS tasks with
parser_summaryare immediately troubleshooted (with original resources preserved), plus adds/updates related tests.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| arc/scheduler_test.py | Adds unit tests for ORCA troubleshooting behavior when memory is capped vs uncapped. |
| arc/scheduler.py | Tightens auto “triple memory” behavior to only apply to the singular memory keyword (case-insensitive). |
| arc/job/trsh_test.py | Updates/extends troubleshooting tests for ORCA memory/cpu edge cases and expected memory values. |
| arc/job/trsh.py | Implements capped-memory-aware ORCA troubleshooting by reducing CPU cores while preserving total memory. |
| arc/job/pipe/pipe_coordinator_test.py | Adds tests ensuring FAILED_ESS tasks use parser_summary for immediate troubleshooting and preserve resources on fallback rerun. |
| arc/job/pipe/pipe_coordinator.py | Adds result.json parser_summary ingestion and routes failed tasks either to troubleshoot_ess or run_job with preserved resources. |
| arc/job/adapter_test.py | Adds tests for tagging capped-memory jobs and preserving max_total_job_memory through ESS parsing; tests job-log-only OOM detection. |
| arc/job/adapter.py | Refreshes additional_job_info before ESS parsing and preserves max_total_job_memory when parsing indicates a Memory error. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| existing_keywords = list(self.job_status[1].get('keywords', list())) | ||
| # Refresh scheduler-side logs before ESS parsing so server-reported OOMs | ||
| # can be detected even when the output file is absent or incomplete. | ||
| self._get_additional_job_info() |
There was a problem hiding this comment.
_check_job_ess_status() now calls _get_additional_job_info() unconditionally, but determine_job_status() already calls _get_additional_job_info() immediately after _check_job_ess_status() (arc/job/adapter.py:708). This results in duplicated log downloads/reads for every completed job status check. Consider removing the later call, or adding a guard here (e.g., only refresh if additional_job_info is None/stale) to avoid the extra I/O.
| self._get_additional_job_info() | |
| # Avoid re-fetching logs if additional job information was already collected | |
| # earlier in the status-check flow. | |
| if not getattr(self, 'additional_job_info', None): | |
| self._get_additional_job_info() |
| from arc.job.factory import job_factory | ||
| from arc.job.pipe.pipe_run import PipeRun, ingest_completed_task | ||
| from arc.job.pipe.pipe_state import ( | ||
| TASK_FAMILY_TO_JOB_TYPE, PipeRunState, TaskState, TaskSpec, |
There was a problem hiding this comment.
TASK_FAMILY_TO_JOB_TYPE is imported but does not appear to be used anywhere in this module anymore. Please remove the unused import to avoid lint/CI failures and keep the module tidy.
| TASK_FAMILY_TO_JOB_TYPE, PipeRunState, TaskState, TaskSpec, | |
| PipeRunState, TaskState, TaskSpec, |
0985c01 to
3aea269
Compare
When an Orca job fails due to insufficient memory and hits a total memory limit, ARC now attempts to resolve the issue by reducing the number of CPU cores to increase the memory available per core. - Added logic to calculate the maximum feasible CPU cores that fit within the total memory cap while meeting Orca's per-core requirements. - Ensures at least one core is utilized if viable, rather than failing the troubleshooting step. - Prevents recalculating/inflating total memory when it is already constrained by a cap. - Adjusted the conservative memory buffer added during total memory estimation from 5 GB to 3 GB. . .
- Make the general memory error check in the scheduler case-insensitive and restricted to cases where 'memory' is the sole error keyword. - Add unit tests for ORCA memory troubleshooting to verify that CPU cores are reduced when the total memory limit is hit, while total memory is increased when no cap is present. . .
…ction When a pipe task fails with an Electronic Structure System (ESS) error, the coordinator now attempts to read the `parser_summary` from the worker's result file. If found, the task is ejected to the scheduler via `troubleshoot_ess` rather than a blind resubmission. This allows the scheduler to apply intelligent troubleshooting logic—such as adjusting memory or CPU cores—based on the specific failure mode encountered during the pipe run. Additionally, resource requirements (CPU and memory) are now explicitly passed when ejecting tasks to ensure consistency between the pipe environment and the scheduler.
- Refresh scheduler-side job logs before parsing ESS status to detect server-reported out-of-memory (OOM) errors even when output files are absent or incomplete. - Tag jobs whose requested memory is clipped to a node's capacity with a `max_total_job_memory` keyword. - Ensure the capped memory marker is preserved during status updates to allow the troubleshooter to distinguish between hitting a node limit versus a simple insufficient memory request. - Enable status determination from additional job info (scheduler logs) when the primary output file cannot be found.
Improves memory error detection and refines troubleshooting for jobs, especially those constrained by node memory limits, and enhances how failed pipe tasks are handled.
Memory Error Detection and Troubleshooting:
additional_job_info) are refreshed before parsing Electronic Structure Software (ESS) output, allowing OOMs to be identified even if the ESS output file is incomplete or missing.max_total_job_memorykeyword, indicating a job's memory was capped by node limits, is preserved during ESS status parsing.max_total_job_memorykeyword, troubleshooting prioritizes reducing CPU cores while keeping the total job memory at its capped value, preventing further unnecessary memory increases.Schedulerto automatically triple job memory for generic memory errors, applying this only when a singular 'memory' keyword is present.Pipe Task Failure Handling:
FAILED_ESSpipe tasks. If a failed task's result includes aparser_summary, ARC now directly uses this detailed information to troubleshoot the job viaScheduler.troubleshoot_ess.