scheds/experimental: add scx_flow v2.2.0 scheduler by galpt · Pull Request #3525 · sched-ext/scx

galpt · 2026-04-08T21:06:41Z

Summary

This is the scx_flow v2 line for upstream review. It supersedes the earlier v1 submission in #3493 rather than trying to extend that branch in place.

Compared to the v1 PR, this branch carries the later v2.0.x -> v2.2.0 scheduler work:

broader v2 wake classification and bounded service-lane model
hot-path cleanup and more cached task-local wake readiness
more local-first dispatch behavior, including per-CPU reserved handling and reduced shared hot-path accounting pressure
hidden --completions <SHELL> support so the scheduler already cooperates with the CLI-completions work in scheds: add --completions flag for shell completion generation #3495
updated scx_flow documentation and public validation links for the current v2.2.0 release snapshot

What Changed Relative to #3493

The earlier PR introduced the first scx_flow version and drew a few clear requests from review. This branch already folds those expectations in:

scx_flow lives under scheds/experimental/scx_flow
the workspace entry follows the current scheds/experimental layout and ordering
the current line is based on the later v2 scheduler work, not the older v1 implementation
the release snapshot and benchmark history are documented publicly in the testing repo instead of only living in local notes

The goal here is not to argue with the original review, but to present the newer line in the shape that review was asking for.

Validation

Local checks:

cargo fmt -p scx_flow
cargo check -p scx_flow
cargo clippy -p scx_flow --no-deps -- -D warnings
cargo run -q -p scx_flow -- --completions bash

Benchmarks and public artifacts:

testing repo: https://github.com/galpt/testing-scx_flow
archived v2.2.0 snapshot: https://github.com/galpt/testing-scx_flow/tree/benchmark-archives/20260409_scx_flow_v2.2.0_release

The v2.2.0 line was also checked manually with Aquarium runs outside the noisy browser harness:

20000 fish at roughly 100-120 FPS
30000 fish at roughly 70-80 FPS

Notes

I kept this PR scoped to the scheduler addition plus the small shared-header change it needs. The completions support is intentionally hidden and is there so this branch does not trip over #3495 once that lands.

Copilot

Pull request overview

This PR updates the experimental scx_flow scheduler to the v2.2.0 line, expanding it into a multi-lane policy (reserved/latency/urgent-latency/shared/contained) with additional per-CPU shaping state, richer metrics, and a hidden --completions flag for shell completion generation.

Changes:

Extend scx_flow’s BPF policy with bounded service lanes (urgent latency, latency, contained) and additional per-task/per-CPU signals and counters.
Add new runtime tunables and metrics plumbing (including per-CPU state aggregation in userspace and expanded stats formatting).
Integrate hidden --completions <SHELL> support and update docs + workspace/package metadata to v2.2.0.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
scheds/experimental/scx_flow/src/bpf/main.bpf.c	Major policy update: new lanes/DSQs, per-CPU state, wake profiling, containment/locality/IPC signals, and many new stats counters.
scheds/experimental/scx_flow/src/bpf/intf.h	Adds constants and `flow_cpu_state` struct used for per-CPU accounting and tuning.
scheds/experimental/scx_flow/src/main.rs	Adds `--completions`, new tunables, and per-CPU map aggregation into exported metrics.
scheds/experimental/scx_flow/src/stats.rs	Expands `Metrics` with many new counters/tunables and updates formatting/delta behavior.
scheds/experimental/scx_flow/README.md	Updates scheduler description and adds a v2.2.0 benchmark snapshot link.
scheds/experimental/scx_flow/Cargo.toml	Bumps version to 2.2.0 and adds `clap_complete` dependency.
Cargo.toml	Reorders workspace membership to place `scx_flow` under `scx_wd40`.
Cargo.lock	Updates `scx_flow` version and adds `clap_complete` dependency entry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T21:18:29Z

+		if (containment_active &&
+		    has_wake_profile(tctx, WAKE_PROFILE_LATENCY_ALLOWANCE))


latency_candidate_hog_blocks is effectively unreachable: containment_active is derived from WAKE_PROFILE_CONTAINMENT_ACTIVE, but WAKE_PROFILE_LATENCY_ALLOWANCE is only set when !containment_active in recompute_wake_profile(). As a result, this counter will never increment and the metric will be misleading. Consider basing this stat on the underlying signals (e.g., tctx->latency_allowance / tctx->latency_pressure) or recording the “would-have-been latency lane candidate” before applying the containment gate.

Suggested change

if (containment_active &&

has_wake_profile(tctx, WAKE_PROFILE_LATENCY_ALLOWANCE))

if (is_wakeup && containment_active && tctx->latency_allowance > 0)

Addressed in 55feb31b.

latency_candidate_hog_blocks now uses the underlying signal (tctx->latency_allowance > 0) instead of the post-gate wake-profile bit, so the counter can actually move when latency-lane candidates are blocked by containment.

Copilot · 2026-04-08T21:18:30Z

+		if (refill_ns >= (s64)FLOW_LATENCY_LANE_REFILL_MIN_NS &&
+		    tctx->budget_ns >= (s64)FLOW_LATENCY_LANE_BUDGET_MIN_NS)
+			raise_latency_allowance(tctx, tuned_latency_credit_grant());
+			FLOW_CPUSTAT_INC(lookup_cpu_state(), budget_refill_events);


The indentation implies FLOW_CPUSTAT_INC(..., budget_refill_events) is part of the preceding if (refill_ns >= ... && ...), but without braces it actually executes for every refill_ns > 0. This is easy to misread and could lead to accidental behavior changes later. Please either add braces or adjust indentation/commenting to make the intended scope explicit.

Suggested change

FLOW_CPUSTAT_INC(lookup_cpu_state(), budget_refill_events);

/* Count all positive budget refill events. */

FLOW_CPUSTAT_INC(lookup_cpu_state(), budget_refill_events);

Addressed in 55feb31b.

The intended scope is now explicit: budget_refill_events still counts all positive refill events, and the indentation/commenting no longer makes it look tied to the latency-lane grant condition.

Copilot

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-12T08:22:32Z

        let exhaustion_ratio = exhaustions as f64 / positive.max(1) as f64;
+        let contained_ratio = contained_enqueues as f64 / positive.max(1) as f64;
+        let direct_reject_ratio = direct_rejections as f64 / direct_candidates.max(1) as f64;
+        let direct_mismatch_ratio = direct_mismatches as f64 / cpu_biases.max(1) as f64;


direct_mismatch_ratio divides direct_local_mismatches by cpu_stability_biases.max(1), but direct_local_mismatches is incremented for any wakeup where wake_cpu != last_cpu (even when no stability bias was applied). This can produce inflated ratios (e.g., cpu_stability_biases == 0 but mismatches > 0) and may incorrectly drive the auto-tuner into throughput mode. Consider using a denominator that represents “attempts” (e.g., last_cpu_matches + direct_local_mismatches) or only counting mismatches when a bias attempt was made.

Suggested change

let direct_mismatch_ratio = direct_mismatches as f64 / cpu_biases.max(1) as f64;

let direct_mismatch_attempts = cpu_biases.saturating_add(direct_mismatches);

let direct_mismatch_ratio =

direct_mismatches as f64 / direct_mismatch_attempts.max(1) as f64;

Both issues are now fixed in commit 45d0278d:

direct_mismatch_ratio: Uses as the denominator now, which properly represents all attempts (biases applied + mismatches that occurred even without bias available).

target_cpu >= 0 guard: Added validation before cpumask operations to handle the case where scx_bpf_task_cpu() returns -1.

Copilot · 2026-04-12T08:22:32Z

+		if (is_wakeup && !containment_active)
+			enq_flags |= SCX_ENQ_HEAD;
+
+		if (has_wake_target || bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr)) {


target_cpu is set from scx_bpf_task_cpu(p) when there is no valid wake target, and the scxtest override returns -1 for this helper. The condition bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr) (and subsequent uses like SCX_DSQ_LOCAL_ON | target_cpu / scx_bpf_kick_cpu(target_cpu, ...)) can therefore receive a negative CPU value, which is undefined for cpumask helpers and can break tests or verifier assumptions. Please guard with valid_sched_cpu(target_cpu) / target_cpu >= 0 before calling cpumask helpers or building DSQ IDs from it.

Suggested change

if (has_wake_target || bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr)) {

bool target_cpu_valid = valid_sched_cpu(target_cpu);

if (has_wake_target ||

(target_cpu_valid &&

bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr))) {

Fixed in 45d0278d. Added target_cpu >= 0 guard before cpumask operations to handle -1 return value from scx_bpf_task_cpu().

Address two remaining issues from Copilot review PR sched-ext#3525: 1. main.rs: Fix direct_mismatch_ratio denominator The original code used cpu_biases as the denominator, but direct_local_mismatches can increment even when no bias was applied (when no stability bias was available). This could cause inflated ratios and incorrect autotuner decisions. Now uses cpu_biases.saturating_add(direct_mismatches) as the denominator to properly represent all attempts. 2. main.bpf.c: Add target_cpu >= 0 guard scx_bpf_task_cpu(p) can return -1 in test scenarios. Passing a negative CPU to bpf_cpumask_test_cpu() or using SCX_DSQ_LOCAL_ON | target_cpu is undefined behavior. Now validates target_cpu before cpumask operations. Fixes: sched-ext#3525 (comment) Fixes: sched-ext#3525 (comment)

Address BPF verifier failures in veristat-dev (scx__bpf__bpf-next). The original fix for Copilot's review added target_cpu >= 0 guard at line 1365, but target_cpu was still used unconditionally in several places after that guard. This could cause verifier issues on development kernels where the verifier is more strict. Changes: 1. Guard scx_bpf_kick_cpu(target_cpu, SCX_KICK_PREEMPT) with valid_sched_cpu(target_cpu) 2. Guard scx_bpf_kick_cpu(target_cpu, SCX_KICK_IDLE) with valid_sched_cpu(target_cpu) 3. Guard the use_local_reserved block with valid_sched_cpu(target_cpu) These changes ensure target_cpu is validated before any DSQ insert or kick operations, not just before the cpumask test. Fixes: sched-ext#3525

Address two remaining issues from Copilot review PR sched-ext#3525: 1. main.rs: Fix direct_mismatch_ratio denominator The original code used cpu_biases as the denominator, but direct_local_mismatches can increment even when no bias was applied (when no stability bias was available). This could cause inflated ratios and incorrect autotuner decisions. Now uses cpu_biases.saturating_add(direct_mismatches) as the denominator to properly represent all attempts. 2. main.bpf.c: Add target_cpu >= 0 guard scx_bpf_task_cpu(p) can return -1 in test scenarios. Passing a negative CPU to bpf_cpumask_test_cpu() or using SCX_DSQ_LOCAL_ON | target_cpu is undefined behavior. Now validates target_cpu before cpumask operations. Fixes: sched-ext#3525 (comment) Fixes: sched-ext#3525 (comment)

This adds the scx_flow v2.2.0 scheduler to sched-ext/scx. scx_flow is a multi-lane scheduling policy designed for real-time and latency-sensitive workloads with support for reserved, latency, urgent-latency, shared, and contained execution lanes. Key features: - Bounded service lanes with priority-based scheduling - Per-task and per-CPU signals and counters - Rich metrics and statistics for observability - Automatic tuning based on workload characteristics Co-authored-by: Galih Tama <galpt@v.recipes> Signed-off-by: Galih Tama <galpt@v.recipes>

The Cargo.lock was not updated when scx_flow was upgraded to v2.2.0 and clap_complete was added as a dependency. This caused CI build failures with 'cannot update the lock file because --locked was passed'. Fixes CI build failure by synchronizing Cargo.lock with the current scx_flow v2.2.0 package metadata.

galpt marked this pull request as ready for review April 8, 2026 21:13

Copilot AI review requested due to automatic review settings April 8, 2026 21:13

Copilot started reviewing on behalf of galpt April 8, 2026 21:14 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

galpt force-pushed the scx_flow_v2_2_scalability branch 2 times, most recently from ee8c0ef to 7c0311a Compare April 10, 2026 11:43

sirlucjan approved these changes Apr 10, 2026

View reviewed changes

galpt requested a review from Copilot April 12, 2026 08:17

Copilot started reviewing on behalf of galpt April 12, 2026 08:18 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

galpt force-pushed the scx_flow_v2_2_scalability branch from 745e5e2 to 121a75f Compare April 12, 2026 14:36

galpt force-pushed the scx_flow_v2_2_scalability branch 3 times, most recently from 7e012cd to 282dfdd Compare April 12, 2026 15:33

galpt force-pushed the scx_flow_v2_2_scalability branch from 282dfdd to 0000386 Compare April 12, 2026 15:35

galpt marked this pull request as draft April 12, 2026 15:43

galpt marked this pull request as ready for review April 12, 2026 16:12

htejun enabled auto-merge April 15, 2026 17:54

htejun added this pull request to the merge queue Apr 15, 2026

Merged via the queue into sched-ext:main with commit 8e1cc4b Apr 15, 2026
35 of 42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheds/experimental: add scx_flow v2.2.0 scheduler#3525

scheds/experimental: add scx_flow v2.2.0 scheduler#3525
htejun merged 2 commits intosched-ext:mainfrom
galpt:scx_flow_v2_2_scalability

galpt commented Apr 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

galpt Apr 8, 2026 •

edited

Loading

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

galpt Apr 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

galpt Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

galpt Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		if (containment_active &&
		has_wake_profile(tctx, WAKE_PROFILE_LATENCY_ALLOWANCE))

	if (containment_active &&
	has_wake_profile(tctx, WAKE_PROFILE_LATENCY_ALLOWANCE))
	if (is_wakeup && containment_active && tctx->latency_allowance > 0)

-		if (has_wake_target || bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr)) {
+		bool target_cpu_valid = valid_sched_cpu(target_cpu);
+		if (has_wake_target ||
+		    (target_cpu_valid &&
+		     bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr))) {

Conversation

galpt commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed Relative to #3493

Validation

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

galpt Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

galpt Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

galpt Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

galpt Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

galpt commented Apr 8, 2026 •

edited

Loading

galpt Apr 8, 2026 •

edited

Loading

galpt Apr 8, 2026 •

edited

Loading