Skip to content

Decouple TS Scheduler Check#864

Draft
calvinp0 wants to merge 1 commit intomainfrom
decouple_sched
Draft

Decouple TS Scheduler Check#864
calvinp0 wants to merge 1 commit intomainfrom
decouple_sched

Conversation

@calvinp0
Copy link
Copy Markdown
Member

@calvinp0 calvinp0 commented Apr 10, 2026

The Bug

When the last conformer job for a TS exhausted troubleshooting, determine_most_likely_ts_conformer was never called — 47 successful conformers were ignored and TS0 was marked "did not converge."

Refactoring

Structural fix — Two-Phase Loop:

  • Phase 1 (simplified): Process individual job completions — end_job, parse, troubleshoot, break. No aggregate pipeline decisions. The fragile inline for/else checks and the conditional troubleshooting break are gone.
  • Phase 2 (new): After all job events, unconditionally check stage transitions for every species. No break can skip it.

Three new methods:

  1. _check_conformer_stage_complete — detects when all conf_opt/conf_sp jobs finished, selects best conformer, spawns opt. Works for both pipe-ejected and regular scheduler jobs.
  2. _check_tsg_stage_complete — detects when all TSG jobs finished, spawns conformer jobs.
  3. _check_species_complete — detects when all jobs for a species finished, calls check_all_done.

Edge cases found:

  • All conformers fail (no energy): Now logs a clear error and sets ts_guesses_exhausted = True so the species properly reports failure instead of silently stalling.
  • Isomorphism troubleshooting exhausted: Now marks conf_opt = True to prevent infinite re-firing of the check every 30 seconds.

All guards are idempotent — Phase 2 runs every iteration but each check fires at most once per species per stage.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors Scheduler.schedule_jobs() to ensure TS conformer-stage completion is evaluated even when the per-job processing loop exits early (e.g., due to troubleshooting/break), preventing successful TS conformers from being ignored.

Changes:

  • Removes inline “all conformer jobs done” / “all TSG jobs done” checks from the per-job event loop.
  • Adds a second-phase per-iteration stage-transition pass via _check_conformer_stage_complete(), _check_tsg_stage_complete(), and _check_species_complete().
  • Adds explicit handling for “all conformers failed (no energy)” and an isomorphism-troubleshooting exhaustion guard.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread arc/scheduler.py
Comment thread arc/scheduler.py
Comment thread arc/scheduler.py
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.09%. Comparing base (c1b0985) to head (dd827e2).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #864   +/-   ##
=======================================
  Coverage   60.08%   60.09%           
=======================================
  Files         102      102           
  Lines       31043    31080   +37     
  Branches     8082     8095   +13     
=======================================
+ Hits        18652    18676   +24     
- Misses      10079    10082    +3     
- Partials     2312     2322   +10     
Flag Coverage Δ
functionaltests 60.09% <ø> (+<0.01%) ⬆️
unittests 60.09% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Refactor the scheduling logic to separate job termination processing from stage transition decisions. By moving conformer, TS guess, and species completion checks into dedicated methods called at the end of the polling loop, the state machine becomes more robust and ensures transitions are evaluated consistently across different execution modes.
Copy link
Copy Markdown
Member

@alongd alongd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @calvinp0! Please see some comments

Comment thread arc/scheduler.py
return
if any('tsg' in j for j in self.running_jobs.get(label, [])):
return
if not self.species_dict[label].is_ts:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intuitively, I'd start with this check (will return from this func more often than the other checks)

Comment thread arc/scheduler.py
del self.running_jobs[label]
for label in list(self.unique_species_labels):
if label in self.output and self.output[label]['convergence'] is False:
continue
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skips failed species but still runs all three checks for already-converged species. Maybe instead, continue if convergence is not None?

Comment thread arc/scheduler.py
return
if any(label in {t.owner_key for t in p.tasks}
for p in self.active_pipes.values()
if any(t.task_family in ('conf_opt', 'conf_sp', 'ts_opt') for t in p.tasks)):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks ts_opt , but the function is about conformers. If a species has active ts_opt jobs but its conformers are all done, this guard would incorrectly block the conformer stage from completing. Restrict to ('conf_opt', 'conf_sp') only?

Comment thread arc/scheduler.py
"""
if 'tsg' not in self.job_dict.get(label, {}):
return
if any('tsg' in j for j in self.running_jobs.get(label, [])):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If TSG jobs are submitted through pipe mode, the method only checks self.running_jobs for 'tsg' entries but doesn't check self.active_pipes

Comment thread arc/scheduler.py
e is not None for e in self.species_dict[label].conformer_energies)

if not has_successful_conformer:
logger.error(f'All conformer jobs for {label} failed. '
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also mark a failed conformer state in the species or .output in scheduler? (I don't remember but we should have a f;lag for this that should be turned on here)

Comment thread arc/scheduler_test.py
)
return sched, spc.label

def test_check_conformer_stage_complete_spawns_opt_for_ts(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test covers the TS path. Add a test for non-TS species?

@calvinp0 calvinp0 marked this pull request as draft April 10, 2026 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants