Skip to content

scheds/experimental: add scx_flow v2.2.0 scheduler#3525

Merged
htejun merged 2 commits intosched-ext:mainfrom
galpt:scx_flow_v2_2_scalability
Apr 15, 2026
Merged

scheds/experimental: add scx_flow v2.2.0 scheduler#3525
htejun merged 2 commits intosched-ext:mainfrom
galpt:scx_flow_v2_2_scalability

Conversation

@galpt
Copy link
Copy Markdown
Contributor

@galpt galpt commented Apr 8, 2026

Summary

This is the scx_flow v2 line for upstream review. It supersedes the earlier v1 submission in #3493 rather than trying to extend that branch in place.

Compared to the v1 PR, this branch carries the later v2.0.x -> v2.2.0 scheduler work:

  • broader v2 wake classification and bounded service-lane model
  • hot-path cleanup and more cached task-local wake readiness
  • more local-first dispatch behavior, including per-CPU reserved handling and reduced shared hot-path accounting pressure
  • hidden --completions <SHELL> support so the scheduler already cooperates with the CLI-completions work in scheds: add --completions flag for shell completion generation #3495
  • updated scx_flow documentation and public validation links for the current v2.2.0 release snapshot

What Changed Relative to #3493

The earlier PR introduced the first scx_flow version and drew a few clear requests from review. This branch already folds those expectations in:

  • scx_flow lives under scheds/experimental/scx_flow
  • the workspace entry follows the current scheds/experimental layout and ordering
  • the current line is based on the later v2 scheduler work, not the older v1 implementation
  • the release snapshot and benchmark history are documented publicly in the testing repo instead of only living in local notes

The goal here is not to argue with the original review, but to present the newer line in the shape that review was asking for.

Validation

Local checks:

  • cargo fmt -p scx_flow
  • cargo check -p scx_flow
  • cargo clippy -p scx_flow --no-deps -- -D warnings
  • cargo run -q -p scx_flow -- --completions bash

Benchmarks and public artifacts:

The v2.2.0 line was also checked manually with Aquarium runs outside the noisy browser harness:

  • 20000 fish at roughly 100-120 FPS
  • 30000 fish at roughly 70-80 FPS

Notes

I kept this PR scoped to the scheduler addition plus the small shared-header change it needs. The completions support is intentionally hidden and is there so this branch does not trip over #3495 once that lands.

@galpt galpt marked this pull request as ready for review April 8, 2026 21:13
Copilot AI review requested due to automatic review settings April 8, 2026 21:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the experimental scx_flow scheduler to the v2.2.0 line, expanding it into a multi-lane policy (reserved/latency/urgent-latency/shared/contained) with additional per-CPU shaping state, richer metrics, and a hidden --completions flag for shell completion generation.

Changes:

  • Extend scx_flow’s BPF policy with bounded service lanes (urgent latency, latency, contained) and additional per-task/per-CPU signals and counters.
  • Add new runtime tunables and metrics plumbing (including per-CPU state aggregation in userspace and expanded stats formatting).
  • Integrate hidden --completions <SHELL> support and update docs + workspace/package metadata to v2.2.0.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scheds/experimental/scx_flow/src/bpf/main.bpf.c Major policy update: new lanes/DSQs, per-CPU state, wake profiling, containment/locality/IPC signals, and many new stats counters.
scheds/experimental/scx_flow/src/bpf/intf.h Adds constants and flow_cpu_state struct used for per-CPU accounting and tuning.
scheds/experimental/scx_flow/src/main.rs Adds --completions, new tunables, and per-CPU map aggregation into exported metrics.
scheds/experimental/scx_flow/src/stats.rs Expands Metrics with many new counters/tunables and updates formatting/delta behavior.
scheds/experimental/scx_flow/README.md Updates scheduler description and adds a v2.2.0 benchmark snapshot link.
scheds/experimental/scx_flow/Cargo.toml Bumps version to 2.2.0 and adds clap_complete dependency.
Cargo.toml Reorders workspace membership to place scx_flow under scx_wd40.
Cargo.lock Updates scx_flow version and adds clap_complete dependency entry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1330 to +1331
if (containment_active &&
has_wake_profile(tctx, WAKE_PROFILE_LATENCY_ALLOWANCE))
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latency_candidate_hog_blocks is effectively unreachable: containment_active is derived from WAKE_PROFILE_CONTAINMENT_ACTIVE, but WAKE_PROFILE_LATENCY_ALLOWANCE is only set when !containment_active in recompute_wake_profile(). As a result, this counter will never increment and the metric will be misleading. Consider basing this stat on the underlying signals (e.g., tctx->latency_allowance / tctx->latency_pressure) or recording the “would-have-been latency lane candidate” before applying the containment gate.

Suggested change
if (containment_active &&
has_wake_profile(tctx, WAKE_PROFILE_LATENCY_ALLOWANCE))
if (is_wakeup && containment_active && tctx->latency_allowance > 0)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

@galpt galpt Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 55feb31b.

latency_candidate_hog_blocks now uses the underlying signal (tctx->latency_allowance > 0) instead of the post-gate wake-profile bit, so the counter can actually move when latency-lane candidates are blocked by containment.

if (refill_ns >= (s64)FLOW_LATENCY_LANE_REFILL_MIN_NS &&
tctx->budget_ns >= (s64)FLOW_LATENCY_LANE_BUDGET_MIN_NS)
raise_latency_allowance(tctx, tuned_latency_credit_grant());
FLOW_CPUSTAT_INC(lookup_cpu_state(), budget_refill_events);
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation implies FLOW_CPUSTAT_INC(..., budget_refill_events) is part of the preceding if (refill_ns >= ... && ...), but without braces it actually executes for every refill_ns > 0. This is easy to misread and could lead to accidental behavior changes later. Please either add braces or adjust indentation/commenting to make the intended scope explicit.

Suggested change
FLOW_CPUSTAT_INC(lookup_cpu_state(), budget_refill_events);
/* Count all positive budget refill events. */
FLOW_CPUSTAT_INC(lookup_cpu_state(), budget_refill_events);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

@galpt galpt Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 55feb31b.

The intended scope is now explicit: budget_refill_events still counts all positive refill events, and the indentation/commenting no longer makes it look tied to the latency-lane grant condition.

@galpt galpt force-pushed the scx_flow_v2_2_scalability branch 2 times, most recently from ee8c0ef to 7c0311a Compare April 10, 2026 11:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

let exhaustion_ratio = exhaustions as f64 / positive.max(1) as f64;
let contained_ratio = contained_enqueues as f64 / positive.max(1) as f64;
let direct_reject_ratio = direct_rejections as f64 / direct_candidates.max(1) as f64;
let direct_mismatch_ratio = direct_mismatches as f64 / cpu_biases.max(1) as f64;
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

direct_mismatch_ratio divides direct_local_mismatches by cpu_stability_biases.max(1), but direct_local_mismatches is incremented for any wakeup where wake_cpu != last_cpu (even when no stability bias was applied). This can produce inflated ratios (e.g., cpu_stability_biases == 0 but mismatches > 0) and may incorrectly drive the auto-tuner into throughput mode. Consider using a denominator that represents “attempts” (e.g., last_cpu_matches + direct_local_mismatches) or only counting mismatches when a bias attempt was made.

Suggested change
let direct_mismatch_ratio = direct_mismatches as f64 / cpu_biases.max(1) as f64;
let direct_mismatch_attempts = cpu_biases.saturating_add(direct_mismatches);
let direct_mismatch_ratio =
direct_mismatches as f64 / direct_mismatch_attempts.max(1) as f64;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both issues are now fixed in commit 45d0278d:

  1. direct_mismatch_ratio: Uses as the denominator now, which properly represents all attempts (biases applied + mismatches that occurred even without bias available).

  2. target_cpu >= 0 guard: Added validation before cpumask operations to handle the case where scx_bpf_task_cpu() returns -1.

if (is_wakeup && !containment_active)
enq_flags |= SCX_ENQ_HEAD;

if (has_wake_target || bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr)) {
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target_cpu is set from scx_bpf_task_cpu(p) when there is no valid wake target, and the scxtest override returns -1 for this helper. The condition bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr) (and subsequent uses like SCX_DSQ_LOCAL_ON | target_cpu / scx_bpf_kick_cpu(target_cpu, ...)) can therefore receive a negative CPU value, which is undefined for cpumask helpers and can break tests or verifier assumptions. Please guard with valid_sched_cpu(target_cpu) / target_cpu >= 0 before calling cpumask helpers or building DSQ IDs from it.

Suggested change
if (has_wake_target || bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr)) {
bool target_cpu_valid = valid_sched_cpu(target_cpu);
if (has_wake_target ||
(target_cpu_valid &&
bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr))) {

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 45d0278d. Added target_cpu >= 0 guard before cpumask operations to handle -1 return value from scx_bpf_task_cpu().

galpt added a commit to galpt/scx that referenced this pull request Apr 12, 2026
Address two remaining issues from Copilot review PR sched-ext#3525:

1. main.rs: Fix direct_mismatch_ratio denominator
   The original code used cpu_biases as the denominator, but
   direct_local_mismatches can increment even when no bias was
   applied (when no stability bias was available). This could
   cause inflated ratios and incorrect autotuner decisions.

   Now uses cpu_biases.saturating_add(direct_mismatches) as
   the denominator to properly represent all attempts.

2. main.bpf.c: Add target_cpu >= 0 guard
   scx_bpf_task_cpu(p) can return -1 in test scenarios.
   Passing a negative CPU to bpf_cpumask_test_cpu() or using
   SCX_DSQ_LOCAL_ON | target_cpu is undefined behavior.

   Now validates target_cpu before cpumask operations.

Fixes: sched-ext#3525 (comment)
Fixes: sched-ext#3525 (comment)
galpt added a commit to galpt/scx that referenced this pull request Apr 12, 2026
Address BPF verifier failures in veristat-dev (scx__bpf__bpf-next).

The original fix for Copilot's review added target_cpu >= 0 guard at
line 1365, but target_cpu was still used unconditionally in several
places after that guard. This could cause verifier issues on
development kernels where the verifier is more strict.

Changes:
1. Guard scx_bpf_kick_cpu(target_cpu, SCX_KICK_PREEMPT) with
   valid_sched_cpu(target_cpu)
2. Guard scx_bpf_kick_cpu(target_cpu, SCX_KICK_IDLE) with
   valid_sched_cpu(target_cpu)
3. Guard the use_local_reserved block with
   valid_sched_cpu(target_cpu)

These changes ensure target_cpu is validated before any DSQ insert
or kick operations, not just before the cpumask test.

Fixes: sched-ext#3525
@galpt galpt force-pushed the scx_flow_v2_2_scalability branch from 745e5e2 to 121a75f Compare April 12, 2026 14:36
galpt added a commit to galpt/scx that referenced this pull request Apr 12, 2026
Address two remaining issues from Copilot review PR sched-ext#3525:

1. main.rs: Fix direct_mismatch_ratio denominator
   The original code used cpu_biases as the denominator, but
   direct_local_mismatches can increment even when no bias was
   applied (when no stability bias was available). This could
   cause inflated ratios and incorrect autotuner decisions.

   Now uses cpu_biases.saturating_add(direct_mismatches) as
   the denominator to properly represent all attempts.

2. main.bpf.c: Add target_cpu >= 0 guard
   scx_bpf_task_cpu(p) can return -1 in test scenarios.
   Passing a negative CPU to bpf_cpumask_test_cpu() or using
   SCX_DSQ_LOCAL_ON | target_cpu is undefined behavior.

   Now validates target_cpu before cpumask operations.

Fixes: sched-ext#3525 (comment)
Fixes: sched-ext#3525 (comment)
@galpt galpt force-pushed the scx_flow_v2_2_scalability branch 3 times, most recently from 7e012cd to 282dfdd Compare April 12, 2026 15:33
This adds the scx_flow v2.2.0 scheduler to sched-ext/scx. scx_flow is a
multi-lane scheduling policy designed for real-time and latency-sensitive
workloads with support for reserved, latency, urgent-latency, shared,
and contained execution lanes.

Key features:
- Bounded service lanes with priority-based scheduling
- Per-task and per-CPU signals and counters
- Rich metrics and statistics for observability
- Automatic tuning based on workload characteristics

Co-authored-by: Galih Tama <galpt@v.recipes>
Signed-off-by: Galih Tama <galpt@v.recipes>
@galpt galpt force-pushed the scx_flow_v2_2_scalability branch from 282dfdd to 0000386 Compare April 12, 2026 15:35
@galpt galpt marked this pull request as draft April 12, 2026 15:43
The Cargo.lock was not updated when scx_flow was upgraded to v2.2.0
and clap_complete was added as a dependency. This caused CI build
failures with 'cannot update the lock file because --locked was passed'.

Fixes CI build failure by synchronizing Cargo.lock with the current
scx_flow v2.2.0 package metadata.
@galpt galpt marked this pull request as ready for review April 12, 2026 16:12
@htejun htejun enabled auto-merge April 15, 2026 17:54
@htejun htejun added this pull request to the merge queue Apr 15, 2026
Merged via the queue into sched-ext:main with commit 8e1cc4b Apr 15, 2026
35 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants