Skip to content

Enhances ORCA memory troubleshooting and pipe task handling#874

Open
calvinp0 wants to merge 4 commits intomainfrom
orca_mem_node
Open

Enhances ORCA memory troubleshooting and pipe task handling#874
calvinp0 wants to merge 4 commits intomainfrom
orca_mem_node

Conversation

@calvinp0
Copy link
Copy Markdown
Member

Improves memory error detection and refines troubleshooting for jobs, especially those constrained by node memory limits, and enhances how failed pipe tasks are handled.

  • Memory Error Detection and Troubleshooting:

    • Improves the detection of server-reported Out-Of-Memory (OOM) errors by ensuring cluster-side logs (additional_job_info) are refreshed before parsing Electronic Structure Software (ESS) output, allowing OOMs to be identified even if the ESS output file is incomplete or missing.
    • Ensures the max_total_job_memory keyword, indicating a job's memory was capped by node limits, is preserved during ESS status parsing.
    • Refines ORCA troubleshooting logic:
      • When a job has the max_total_job_memory keyword, troubleshooting prioritizes reducing CPU cores while keeping the total job memory at its capped value, preventing further unnecessary memory increases.
      • Includes more robust logic for reducing CPU cores to avoid invalid configurations (e.g., zero or negative cores).
    • Clarifies conditions for the Scheduler to automatically triple job memory for generic memory errors, applying this only when a singular 'memory' keyword is present.
  • Pipe Task Failure Handling:

    • Enables immediate, intelligent troubleshooting for FAILED_ESS pipe tasks. If a failed task's result includes a parser_summary, ARC now directly uses this detailed information to troubleshoot the job via Scheduler.troubleshoot_ess.
    • This intelligent routing bypasses a blind re-run, allowing for more targeted adjustments to job parameters (e.g., CPU, memory) based on the detailed parsing output.
    • Original job resources (CPU cores, memory) are correctly propagated when ejecting tasks to the scheduler for re-run or troubleshooting.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves ARC’s handling of memory-related ESS failures (especially ORCA and node memory caps) and enhances pipe-mode behavior for FAILED_ESS tasks by routing them into targeted ESS troubleshooting when parser details are available.

Changes:

  • Refreshes and leverages scheduler-side job logs for improved OOM detection and preserves max_total_job_memory across ESS status parsing.
  • Refines ORCA troubleshooting to reduce CPU cores when total memory is capped, and tightens generic “memory” error handling in the scheduler.
  • Updates pipe ingestion so FAILED_ESS tasks with parser_summary are immediately troubleshooted (with original resources preserved), plus adds/updates related tests.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
arc/scheduler_test.py Adds unit tests for ORCA troubleshooting behavior when memory is capped vs uncapped.
arc/scheduler.py Tightens auto “triple memory” behavior to only apply to the singular memory keyword (case-insensitive).
arc/job/trsh_test.py Updates/extends troubleshooting tests for ORCA memory/cpu edge cases and expected memory values.
arc/job/trsh.py Implements capped-memory-aware ORCA troubleshooting by reducing CPU cores while preserving total memory.
arc/job/pipe/pipe_coordinator_test.py Adds tests ensuring FAILED_ESS tasks use parser_summary for immediate troubleshooting and preserve resources on fallback rerun.
arc/job/pipe/pipe_coordinator.py Adds result.json parser_summary ingestion and routes failed tasks either to troubleshoot_ess or run_job with preserved resources.
arc/job/adapter_test.py Adds tests for tagging capped-memory jobs and preserving max_total_job_memory through ESS parsing; tests job-log-only OOM detection.
arc/job/adapter.py Refreshes additional_job_info before ESS parsing and preserves max_total_job_memory when parsing indicates a Memory error.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread arc/job/adapter.py
existing_keywords = list(self.job_status[1].get('keywords', list()))
# Refresh scheduler-side logs before ESS parsing so server-reported OOMs
# can be detected even when the output file is absent or incomplete.
self._get_additional_job_info()
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_check_job_ess_status() now calls _get_additional_job_info() unconditionally, but determine_job_status() already calls _get_additional_job_info() immediately after _check_job_ess_status() (arc/job/adapter.py:708). This results in duplicated log downloads/reads for every completed job status check. Consider removing the later call, or adding a guard here (e.g., only refresh if additional_job_info is None/stale) to avoid the extra I/O.

Suggested change
self._get_additional_job_info()
# Avoid re-fetching logs if additional job information was already collected
# earlier in the status-check flow.
if not getattr(self, 'additional_job_info', None):
self._get_additional_job_info()

Copilot uses AI. Check for mistakes.
from arc.job.factory import job_factory
from arc.job.pipe.pipe_run import PipeRun, ingest_completed_task
from arc.job.pipe.pipe_state import (
TASK_FAMILY_TO_JOB_TYPE, PipeRunState, TaskState, TaskSpec,
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TASK_FAMILY_TO_JOB_TYPE is imported but does not appear to be used anywhere in this module anymore. Please remove the unused import to avoid lint/CI failures and keep the module tidy.

Suggested change
TASK_FAMILY_TO_JOB_TYPE, PipeRunState, TaskState, TaskSpec,
PipeRunState, TaskState, TaskSpec,

Copilot uses AI. Check for mistakes.
@calvinp0 calvinp0 force-pushed the orca_mem_node branch 2 times, most recently from 0985c01 to 3aea269 Compare April 20, 2026 06:27
When an Orca job fails due to insufficient memory and hits a total memory limit, ARC now attempts to resolve the issue by reducing the number of CPU cores to increase the memory available per core.

- Added logic to calculate the maximum feasible CPU cores that fit within the total memory cap while meeting Orca's per-core requirements.
- Ensures at least one core is utilized if viable, rather than failing the troubleshooting step.
- Prevents recalculating/inflating total memory when it is already constrained by a cap.
- Adjusted the conservative memory buffer added during total memory estimation from 5 GB to 3 GB.

.

.
- Make the general memory error check in the scheduler case-insensitive and restricted to cases where 'memory' is the sole error keyword.
- Add unit tests for ORCA memory troubleshooting to verify that CPU cores are reduced when the total memory limit is hit, while total memory is increased when no cap is present.

.

.
…ction

When a pipe task fails with an Electronic Structure System (ESS) error, the coordinator now attempts to read the `parser_summary` from the worker's result file. If found, the task is ejected to the scheduler via `troubleshoot_ess` rather than a blind resubmission. This allows the scheduler to apply intelligent troubleshooting logic—such as adjusting memory or CPU cores—based on the specific failure mode encountered during the pipe run.

Additionally, resource requirements (CPU and memory) are now explicitly passed when ejecting tasks to ensure consistency between the pipe environment and the scheduler.
- Refresh scheduler-side job logs before parsing ESS status to detect server-reported out-of-memory (OOM) errors even when output files are absent or incomplete.
- Tag jobs whose requested memory is clipped to a node's capacity with a `max_total_job_memory` keyword.
- Ensure the capped memory marker is preserved during status updates to allow the troubleshooter to distinguish between hitting a node limit versus a simple insufficient memory request.
- Enable status determination from additional job info (scheduler logs) when the primary output file cannot be found.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants