Enhances ORCA memory troubleshooting and pipe task handling by calvinp0 · Pull Request #874 · ReactionMechanismGenerator/ARC

calvinp0 · 2026-04-19T12:45:29Z

Improves memory error detection and refines troubleshooting for jobs, especially those constrained by node memory limits, and enhances how failed pipe tasks are handled.

Memory Error Detection and Troubleshooting:
- Improves the detection of server-reported Out-Of-Memory (OOM) errors by ensuring cluster-side logs (additional_job_info) are refreshed before parsing Electronic Structure Software (ESS) output, allowing OOMs to be identified even if the ESS output file is incomplete or missing.
- Ensures the max_total_job_memory keyword, indicating a job's memory was capped by node limits, is preserved during ESS status parsing.
- Refines ORCA troubleshooting logic:
  - When a job has the max_total_job_memory keyword, troubleshooting prioritizes reducing CPU cores while keeping the total job memory at its capped value, preventing further unnecessary memory increases.
  - Includes more robust logic for reducing CPU cores to avoid invalid configurations (e.g., zero or negative cores).
- Clarifies conditions for the Scheduler to automatically triple job memory for generic memory errors, applying this only when a singular 'memory' keyword is present.
Pipe Task Failure Handling:
- Enables immediate, intelligent troubleshooting for FAILED_ESS pipe tasks. If a failed task's result includes a parser_summary, ARC now directly uses this detailed information to troubleshoot the job via Scheduler.troubleshoot_ess.
- This intelligent routing bypasses a blind re-run, allowing for more targeted adjustments to job parameters (e.g., CPU, memory) based on the detailed parsing output.
- Original job resources (CPU cores, memory) are correctly propagated when ejecting tasks to the scheduler for re-run or troubleshooting.

Copilot

Pull request overview

Improves ARC’s handling of memory-related ESS failures (especially ORCA and node memory caps) and enhances pipe-mode behavior for FAILED_ESS tasks by routing them into targeted ESS troubleshooting when parser details are available.

Changes:

Refreshes and leverages scheduler-side job logs for improved OOM detection and preserves max_total_job_memory across ESS status parsing.
Refines ORCA troubleshooting to reduce CPU cores when total memory is capped, and tightens generic “memory” error handling in the scheduler.
Updates pipe ingestion so FAILED_ESS tasks with parser_summary are immediately troubleshooted (with original resources preserved), plus adds/updates related tests.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
arc/scheduler_test.py	Adds unit tests for ORCA troubleshooting behavior when memory is capped vs uncapped.
arc/scheduler.py	Tightens auto “triple memory” behavior to only apply to the singular `memory` keyword (case-insensitive).
arc/job/trsh_test.py	Updates/extends troubleshooting tests for ORCA memory/cpu edge cases and expected memory values.
arc/job/trsh.py	Implements capped-memory-aware ORCA troubleshooting by reducing CPU cores while preserving total memory.
arc/job/pipe/pipe_coordinator_test.py	Adds tests ensuring FAILED_ESS tasks use `parser_summary` for immediate troubleshooting and preserve resources on fallback rerun.
arc/job/pipe/pipe_coordinator.py	Adds `result.json` parser_summary ingestion and routes failed tasks either to `troubleshoot_ess` or `run_job` with preserved resources.
arc/job/adapter_test.py	Adds tests for tagging capped-memory jobs and preserving `max_total_job_memory` through ESS parsing; tests job-log-only OOM detection.
arc/job/adapter.py	Refreshes `additional_job_info` before ESS parsing and preserves `max_total_job_memory` when parsing indicates a Memory error.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-19T12:49:45Z

+        existing_keywords = list(self.job_status[1].get('keywords', list()))
+        # Refresh scheduler-side logs before ESS parsing so server-reported OOMs
+        # can be detected even when the output file is absent or incomplete.
+        self._get_additional_job_info()


_check_job_ess_status() now calls _get_additional_job_info() unconditionally, but determine_job_status() already calls _get_additional_job_info() immediately after _check_job_ess_status() (arc/job/adapter.py:708). This results in duplicated log downloads/reads for every completed job status check. Consider removing the later call, or adding a guard here (e.g., only refresh if additional_job_info is None/stale) to avoid the extra I/O.

Suggested change

self._get_additional_job_info()

# Avoid re-fetching logs if additional job information was already collected

# earlier in the status-check flow.

if not getattr(self, 'additional_job_info', None):

self._get_additional_job_info()

Copilot · 2026-04-19T12:49:45Z

+from arc.job.factory import job_factory
 from arc.job.pipe.pipe_run import PipeRun, ingest_completed_task
 from arc.job.pipe.pipe_state import (
    TASK_FAMILY_TO_JOB_TYPE, PipeRunState, TaskState, TaskSpec,


TASK_FAMILY_TO_JOB_TYPE is imported but does not appear to be used anywhere in this module anymore. Please remove the unused import to avoid lint/CI failures and keep the module tidy.

Suggested change

TASK_FAMILY_TO_JOB_TYPE, PipeRunState, TaskState, TaskSpec,

PipeRunState, TaskState, TaskSpec,

When an Orca job fails due to insufficient memory and hits a total memory limit, ARC now attempts to resolve the issue by reducing the number of CPU cores to increase the memory available per core. - Added logic to calculate the maximum feasible CPU cores that fit within the total memory cap while meeting Orca's per-core requirements. - Ensures at least one core is utilized if viable, rather than failing the troubleshooting step. - Prevents recalculating/inflating total memory when it is already constrained by a cap. - Adjusted the conservative memory buffer added during total memory estimation from 5 GB to 3 GB. . .

- Make the general memory error check in the scheduler case-insensitive and restricted to cases where 'memory' is the sole error keyword. - Add unit tests for ORCA memory troubleshooting to verify that CPU cores are reduced when the total memory limit is hit, while total memory is increased when no cap is present. . .

…ction When a pipe task fails with an Electronic Structure System (ESS) error, the coordinator now attempts to read the `parser_summary` from the worker's result file. If found, the task is ejected to the scheduler via `troubleshoot_ess` rather than a blind resubmission. This allows the scheduler to apply intelligent troubleshooting logic—such as adjusting memory or CPU cores—based on the specific failure mode encountered during the pipe run. Additionally, resource requirements (CPU and memory) are now explicitly passed when ejecting tasks to ensure consistency between the pipe environment and the scheduler.

- Refresh scheduler-side job logs before parsing ESS status to detect server-reported out-of-memory (OOM) errors even when output files are absent or incomplete. - Tag jobs whose requested memory is clipped to a node's capacity with a `max_total_job_memory` keyword. - Ensure the capped memory marker is preserved during status updates to allow the troubleshooter to distinguish between hitting a node limit versus a simple insufficient memory request. - Enable status determination from additional job info (scheduler logs) when the primary output file cannot be found.

calvinp0 requested review from Lilachn91, alongd and Copilot April 19, 2026 12:45

github-actions bot added Module: Scheduler Module: trsh Troubleshooting labels Apr 19, 2026

Copilot started reviewing on behalf of calvinp0 April 19, 2026 12:46 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

calvinp0 force-pushed the orca_mem_node branch 2 times, most recently from 0985c01 to 3aea269 Compare April 20, 2026 06:27

calvinp0 added 4 commits April 20, 2026 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhances ORCA memory troubleshooting and pipe task handling#874

Enhances ORCA memory troubleshooting and pipe task handling#874
calvinp0 wants to merge 4 commits intomainfrom
orca_mem_node

calvinp0 commented Apr 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        self._get_additional_job_info()
+        # Avoid re-fetching logs if additional job information was already collected
+        # earlier in the status-check flow.
+        if not getattr(self, 'additional_job_info', None):
+            self._get_additional_job_info()

	TASK_FAMILY_TO_JOB_TYPE, PipeRunState, TaskState, TaskSpec,
	PipeRunState, TaskState, TaskSpec,

Conversation

calvinp0 commented Apr 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants