Skip to content

Galloc issue#875

Open
calvinp0 wants to merge 4 commits intomainfrom
galloc_issue
Open

Galloc issue#875
calvinp0 wants to merge 4 commits intomainfrom
galloc_issue

Conversation

@calvinp0
Copy link
Copy Markdown
Member

This pull request refactors how ARC handles memory allocation for jobs, standardizing all internal memory calculations to use base-2 (MiB/GiB) units and ensuring consistent, conservative memory requests across different cluster schedulers and Gaussian input files. It introduces explicit memory overhead factors and reserves headroom for non-Gaussian overhead, preventing over-allocation and improving stability, especially on capped nodes. The changes also update tests to reflect the new memory calculations.

Memory calculation and handling improvements:

  • Standardized all internal memory math to use base-2 units (GiB/MiB), introducing constants like MEMORY_GB_TO_MIB, DEFAULT_JOB_MEMORY_OVERHEAD, and CAPPED_JOB_MEMORY_OVERHEAD for clarity and consistency.
  • Refactored set_cpu_and_mem to always compute and store submit script memory in MiB, applying the appropriate overhead factor, and updated memory handling logic for different schedulers (SGE/OGE, PBS, Slurm) to use these MiB values.
  • Added submit_script_memory_mib attribute throughout the codebase for clarity and updated initialization accordingly.

Gaussian-specific memory allocation:

  • Added a fixed GAUSSIAN_MEMORY_HEADROOM_FRACTION (set to 0.90) so Gaussian jobs only consume a portion of the allocated memory, leaving room for scheduler and runtime overhead. Updated set_input_file_memory to use this fraction when setting %mem in Gaussian input files. [1] [2]

Test updates:

Gaussian's `%mem` directive should not consume the entire scheduler allocation. ARC now reserves a fixed fraction (90%) of the submit-script memory for non-Gaussian overhead, such as the scheduler, scratch bookkeeping, and Gaussian allocations that occur outside the explicit memory budget.

This prevents job failures on capped nodes where the physical memory is slightly less than the requested amount and avoids triggering Gaussian allocation (galloc) errors or scheduler kills.
Replaces the hardcoded 95% memory allocation cap with a configurable value from default_job_settings.

Additionally, ensures troubleshooting terminates with a clear error message if a job requires more memory than the configured node-level cap allows. This prevents potential infinite resubmission loops when memory is the limiting factor for Gaussian or other ESS jobs.
Centralizes memory unit conversions using explicit MiB/GiB constants and clarifies ARC's use of base-2 units internally.

This update ensures consistent application of memory overhead (10% default or 5% when capped) across different schedulers and fixes an inconsistency in PBS memory requests where MiB is now passed directly to the template rather than being converted to bytes.
Adds the `submit_script_memory_mib` attribute to the common job adapter and clarifies via comments in PBS submission templates that ARC passes memory values as base-2 integers (MiB). This ensures the internal standardization to MiB is correctly reflected in the adapter state and documented for scheduler directives.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors ARC’s job memory handling to use consistent base-2 units (MiB/GiB), apply explicit overhead factors, and reserve headroom for Gaussian %mem to avoid galloc/memory failures on capped nodes.

Changes:

  • Standardize internal submit-script memory calculations to MiB with explicit overhead factors and store submit_script_memory_mib.
  • Apply a fixed headroom fraction when setting Gaussian %mem, using submit-script memory as the budget source.
  • Update troubleshooting and unit tests to reflect the new memory cap/headroom behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
arc/settings/submit.py Adds clarifying comments around scheduler memory units/templates.
arc/job/adapter.py Introduces MiB conversion/overhead constants and refactors scheduler memory setting logic.
arc/job/adapters/common.py Initializes the new submit_script_memory_mib attribute.
arc/job/adapters/gaussian.py Sets Gaussian %mem from submit-script memory with reserved headroom.
arc/job/trsh.py Uses configured node-memory cap when attempting Gaussian memory troubleshooting; errors when capped.
arc/job/adapter_test.py Updates expected scheduler memory values to match new MiB logic.
arc/job/adapters/gaussian_test.py Updates expected %mem and asserts headroom / new MiB attribute behavior.
arc/job/trsh_test.py Adds coverage for the “already at node-memory cap” Gaussian troubleshooting path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread arc/settings/submit.py
Comment on lines +111 to +113
# ARC passes a base-2-derived MiB integer into this sample PBS
# template. On our PBS systems, the `mb` suffix is interpreted in
# base-2, so the directive stays `mem={memory}mb`.
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above this submit script says it's a "sample PBS template" and refers to a mem={memory}mb PBS directive, but the actual script below is a Slurm #SBATCH template using --mem-per-cpu={memory}. Please update/remove the comment so it matches the scheduler directives in this block (or move it to the actual PBS template section).

Suggested change
# ARC passes a base-2-derived MiB integer into this sample PBS
# template. On our PBS systems, the `mb` suffix is interpreted in
# base-2, so the directive stays `mem={memory}mb`.
# ARC passes a base-2-derived MiB integer into this sample Slurm
# template. The memory request is emitted via the Slurm directive
# `#SBATCH --mem-per-cpu={memory}`.

Copilot uses AI. Check for mistakes.
@calvinp0 calvinp0 removed the request for review from alongd April 19, 2026 14:35
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.39%. Comparing base (70dab25) to head (b0459fd).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #875      +/-   ##
==========================================
- Coverage   60.47%   60.39%   -0.08%     
==========================================
  Files         102      102              
  Lines       31102    31113      +11     
  Branches     8104     8104              
==========================================
- Hits        18808    18792      -16     
- Misses       9957     9975      +18     
- Partials     2337     2346       +9     
Flag Coverage Δ
functionaltests 60.39% <ø> (-0.08%) ⬇️
unittests 60.39% <ø> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Module: trsh Troubleshooting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants