scheds/experimental: add scx_flow v2.2.0 scheduler#3525
scheds/experimental: add scx_flow v2.2.0 scheduler#3525htejun merged 2 commits intosched-ext:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the experimental scx_flow scheduler to the v2.2.0 line, expanding it into a multi-lane policy (reserved/latency/urgent-latency/shared/contained) with additional per-CPU shaping state, richer metrics, and a hidden --completions flag for shell completion generation.
Changes:
- Extend
scx_flow’s BPF policy with bounded service lanes (urgent latency, latency, contained) and additional per-task/per-CPU signals and counters. - Add new runtime tunables and metrics plumbing (including per-CPU state aggregation in userspace and expanded stats formatting).
- Integrate hidden
--completions <SHELL>support and update docs + workspace/package metadata to v2.2.0.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| scheds/experimental/scx_flow/src/bpf/main.bpf.c | Major policy update: new lanes/DSQs, per-CPU state, wake profiling, containment/locality/IPC signals, and many new stats counters. |
| scheds/experimental/scx_flow/src/bpf/intf.h | Adds constants and flow_cpu_state struct used for per-CPU accounting and tuning. |
| scheds/experimental/scx_flow/src/main.rs | Adds --completions, new tunables, and per-CPU map aggregation into exported metrics. |
| scheds/experimental/scx_flow/src/stats.rs | Expands Metrics with many new counters/tunables and updates formatting/delta behavior. |
| scheds/experimental/scx_flow/README.md | Updates scheduler description and adds a v2.2.0 benchmark snapshot link. |
| scheds/experimental/scx_flow/Cargo.toml | Bumps version to 2.2.0 and adds clap_complete dependency. |
| Cargo.toml | Reorders workspace membership to place scx_flow under scx_wd40. |
| Cargo.lock | Updates scx_flow version and adds clap_complete dependency entry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (containment_active && | ||
| has_wake_profile(tctx, WAKE_PROFILE_LATENCY_ALLOWANCE)) |
There was a problem hiding this comment.
latency_candidate_hog_blocks is effectively unreachable: containment_active is derived from WAKE_PROFILE_CONTAINMENT_ACTIVE, but WAKE_PROFILE_LATENCY_ALLOWANCE is only set when !containment_active in recompute_wake_profile(). As a result, this counter will never increment and the metric will be misleading. Consider basing this stat on the underlying signals (e.g., tctx->latency_allowance / tctx->latency_pressure) or recording the “would-have-been latency lane candidate” before applying the containment gate.
| if (containment_active && | |
| has_wake_profile(tctx, WAKE_PROFILE_LATENCY_ALLOWANCE)) | |
| if (is_wakeup && containment_active && tctx->latency_allowance > 0) |
There was a problem hiding this comment.
Addressed in 55feb31b.
latency_candidate_hog_blocks now uses the underlying signal (tctx->latency_allowance > 0) instead of the post-gate wake-profile bit, so the counter can actually move when latency-lane candidates are blocked by containment.
| if (refill_ns >= (s64)FLOW_LATENCY_LANE_REFILL_MIN_NS && | ||
| tctx->budget_ns >= (s64)FLOW_LATENCY_LANE_BUDGET_MIN_NS) | ||
| raise_latency_allowance(tctx, tuned_latency_credit_grant()); | ||
| FLOW_CPUSTAT_INC(lookup_cpu_state(), budget_refill_events); |
There was a problem hiding this comment.
The indentation implies FLOW_CPUSTAT_INC(..., budget_refill_events) is part of the preceding if (refill_ns >= ... && ...), but without braces it actually executes for every refill_ns > 0. This is easy to misread and could lead to accidental behavior changes later. Please either add braces or adjust indentation/commenting to make the intended scope explicit.
| FLOW_CPUSTAT_INC(lookup_cpu_state(), budget_refill_events); | |
| /* Count all positive budget refill events. */ | |
| FLOW_CPUSTAT_INC(lookup_cpu_state(), budget_refill_events); |
There was a problem hiding this comment.
Addressed in 55feb31b.
The intended scope is now explicit: budget_refill_events still counts all positive refill events, and the indentation/commenting no longer makes it look tied to the latency-lane grant condition.
ee8c0ef to
7c0311a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let exhaustion_ratio = exhaustions as f64 / positive.max(1) as f64; | ||
| let contained_ratio = contained_enqueues as f64 / positive.max(1) as f64; | ||
| let direct_reject_ratio = direct_rejections as f64 / direct_candidates.max(1) as f64; | ||
| let direct_mismatch_ratio = direct_mismatches as f64 / cpu_biases.max(1) as f64; |
There was a problem hiding this comment.
direct_mismatch_ratio divides direct_local_mismatches by cpu_stability_biases.max(1), but direct_local_mismatches is incremented for any wakeup where wake_cpu != last_cpu (even when no stability bias was applied). This can produce inflated ratios (e.g., cpu_stability_biases == 0 but mismatches > 0) and may incorrectly drive the auto-tuner into throughput mode. Consider using a denominator that represents “attempts” (e.g., last_cpu_matches + direct_local_mismatches) or only counting mismatches when a bias attempt was made.
| let direct_mismatch_ratio = direct_mismatches as f64 / cpu_biases.max(1) as f64; | |
| let direct_mismatch_attempts = cpu_biases.saturating_add(direct_mismatches); | |
| let direct_mismatch_ratio = | |
| direct_mismatches as f64 / direct_mismatch_attempts.max(1) as f64; |
There was a problem hiding this comment.
Both issues are now fixed in commit 45d0278d:
-
direct_mismatch_ratio: Uses as the denominator now, which properly represents all attempts (biases applied + mismatches that occurred even without bias available).
-
target_cpu >= 0 guard: Added validation before cpumask operations to handle the case where scx_bpf_task_cpu() returns -1.
| if (is_wakeup && !containment_active) | ||
| enq_flags |= SCX_ENQ_HEAD; | ||
|
|
||
| if (has_wake_target || bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr)) { |
There was a problem hiding this comment.
target_cpu is set from scx_bpf_task_cpu(p) when there is no valid wake target, and the scxtest override returns -1 for this helper. The condition bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr) (and subsequent uses like SCX_DSQ_LOCAL_ON | target_cpu / scx_bpf_kick_cpu(target_cpu, ...)) can therefore receive a negative CPU value, which is undefined for cpumask helpers and can break tests or verifier assumptions. Please guard with valid_sched_cpu(target_cpu) / target_cpu >= 0 before calling cpumask helpers or building DSQ IDs from it.
| if (has_wake_target || bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr)) { | |
| bool target_cpu_valid = valid_sched_cpu(target_cpu); | |
| if (has_wake_target || | |
| (target_cpu_valid && | |
| bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr))) { |
There was a problem hiding this comment.
Fixed in 45d0278d. Added target_cpu >= 0 guard before cpumask operations to handle -1 return value from scx_bpf_task_cpu().
Address two remaining issues from Copilot review PR sched-ext#3525: 1. main.rs: Fix direct_mismatch_ratio denominator The original code used cpu_biases as the denominator, but direct_local_mismatches can increment even when no bias was applied (when no stability bias was available). This could cause inflated ratios and incorrect autotuner decisions. Now uses cpu_biases.saturating_add(direct_mismatches) as the denominator to properly represent all attempts. 2. main.bpf.c: Add target_cpu >= 0 guard scx_bpf_task_cpu(p) can return -1 in test scenarios. Passing a negative CPU to bpf_cpumask_test_cpu() or using SCX_DSQ_LOCAL_ON | target_cpu is undefined behavior. Now validates target_cpu before cpumask operations. Fixes: sched-ext#3525 (comment) Fixes: sched-ext#3525 (comment)
Address BPF verifier failures in veristat-dev (scx__bpf__bpf-next). The original fix for Copilot's review added target_cpu >= 0 guard at line 1365, but target_cpu was still used unconditionally in several places after that guard. This could cause verifier issues on development kernels where the verifier is more strict. Changes: 1. Guard scx_bpf_kick_cpu(target_cpu, SCX_KICK_PREEMPT) with valid_sched_cpu(target_cpu) 2. Guard scx_bpf_kick_cpu(target_cpu, SCX_KICK_IDLE) with valid_sched_cpu(target_cpu) 3. Guard the use_local_reserved block with valid_sched_cpu(target_cpu) These changes ensure target_cpu is validated before any DSQ insert or kick operations, not just before the cpumask test. Fixes: sched-ext#3525
745e5e2 to
121a75f
Compare
Address two remaining issues from Copilot review PR sched-ext#3525: 1. main.rs: Fix direct_mismatch_ratio denominator The original code used cpu_biases as the denominator, but direct_local_mismatches can increment even when no bias was applied (when no stability bias was available). This could cause inflated ratios and incorrect autotuner decisions. Now uses cpu_biases.saturating_add(direct_mismatches) as the denominator to properly represent all attempts. 2. main.bpf.c: Add target_cpu >= 0 guard scx_bpf_task_cpu(p) can return -1 in test scenarios. Passing a negative CPU to bpf_cpumask_test_cpu() or using SCX_DSQ_LOCAL_ON | target_cpu is undefined behavior. Now validates target_cpu before cpumask operations. Fixes: sched-ext#3525 (comment) Fixes: sched-ext#3525 (comment)
7e012cd to
282dfdd
Compare
This adds the scx_flow v2.2.0 scheduler to sched-ext/scx. scx_flow is a multi-lane scheduling policy designed for real-time and latency-sensitive workloads with support for reserved, latency, urgent-latency, shared, and contained execution lanes. Key features: - Bounded service lanes with priority-based scheduling - Per-task and per-CPU signals and counters - Rich metrics and statistics for observability - Automatic tuning based on workload characteristics Co-authored-by: Galih Tama <galpt@v.recipes> Signed-off-by: Galih Tama <galpt@v.recipes>
282dfdd to
0000386
Compare
The Cargo.lock was not updated when scx_flow was upgraded to v2.2.0 and clap_complete was added as a dependency. This caused CI build failures with 'cannot update the lock file because --locked was passed'. Fixes CI build failure by synchronizing Cargo.lock with the current scx_flow v2.2.0 package metadata.
Summary
This is the
scx_flowv2 line for upstream review. It supersedes the earlier v1 submission in #3493 rather than trying to extend that branch in place.Compared to the v1 PR, this branch carries the later
v2.0.x -> v2.2.0scheduler work:v2wake classification and bounded service-lane model--completions <SHELL>support so the scheduler already cooperates with the CLI-completions work in scheds: add --completions flag for shell completion generation #3495scx_flowdocumentation and public validation links for the currentv2.2.0release snapshotWhat Changed Relative to #3493
The earlier PR introduced the first
scx_flowversion and drew a few clear requests from review. This branch already folds those expectations in:scx_flowlives underscheds/experimental/scx_flowscheds/experimentallayout and orderingv2scheduler work, not the olderv1implementationThe goal here is not to argue with the original review, but to present the newer line in the shape that review was asking for.
Validation
Local checks:
cargo fmt -p scx_flowcargo check -p scx_flowcargo clippy -p scx_flow --no-deps -- -D warningscargo run -q -p scx_flow -- --completions bashBenchmarks and public artifacts:
v2.2.0snapshot: https://github.com/galpt/testing-scx_flow/tree/benchmark-archives/20260409_scx_flow_v2.2.0_releaseThe
v2.2.0line was also checked manually with Aquarium runs outside the noisy browser harness:20000fish at roughly100-120 FPS30000fish at roughly70-80 FPSNotes
I kept this PR scoped to the scheduler addition plus the small shared-header change it needs. The completions support is intentionally hidden and is there so this branch does not trip over #3495 once that lands.