You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The sched_ext numa selftest is flaky, failing intermittently in BPF CI
due to an inherently racy idle-state assertion in the numa_select_cpu
BPF program. The test picks an idle CPU (which atomically clears its idle
bit), then immediately re-checks the idle cpumask — but the CPU can
legitimately transition back to idle in the intervening window.
The numa test (tools/testing/selftests/sched_ext/numa.bpf.c) loads a
BPF sched_ext scheduler with per-NUMA-node DSQs and NUMA-aware idle CPU
selection. The numa_select_cpu callback:
Calls scx_bpf_pick_idle_cpu_node() to find an idle CPU in the task's
NUMA node. This atomically clears the CPU's idle bit via cpumask_test_and_clear_cpu() in pick_idle_cpu_in_node()
(kernel/sched/ext_idle.c:137).
If no idle CPU is found, falls back to scx_bpf_pick_any_cpu_node().
Asserts that the picked CPU is no longer in the idle cpumask
by calling is_cpu_idle() → scx_bpf_get_idle_cpumask_node() → bpf_cpumask_test_cpu().
Asserts that the picked CPU belongs to the correct NUMA node.
The race
Between step 1 (clearing the idle bit) and step 3 (checking the idle
cpumask), the selected CPU can legitimately transition back to idle:
CPU 0 (running numa_select_cpu): CPU 2 (running its last task):
───────────────────────────────── ─────────────────────────────────
pick_idle_cpu_node() → CPU 2
cpumask_test_and_clear_cpu(2) task finishes, enters idle
(CPU 2 idle bit: 0) __scx_update_idle(rq, true, true)
update_builtin_idle(2, true)
assign_cpu(2, idle_cpus, true)
(CPU 2 idle bit: 1)
is_cpu_idle(2) → true!
scx_bpf_error("CPU 2 should
be marked as busy")
The update_builtin_idle() call at kernel/sched/ext_idle.c:691
uses assign_cpu() to set the idle bit. This is a normal idle
transition — the CPU genuinely has no more work — and is not a bug
in the idle tracking infrastructure.
Why it manifests in CI
The CI VMs run QEMU with 4 vCPUs on a single NUMA node (No NUMA configuration found). With few CPUs and light workload (the userspace
test just calls sleep(1)), CPUs frequently oscillate between idle
and busy states. The window between pick_idle_cpu and is_cpu_idle is small but non-zero, and with thousands of numa_select_cpu invocations during the 1-second test window,
the race triggers occasionally.
The node check is not racy
The second assertion — scx_bpf_cpu_node(cpu) != node — is safe
because a CPU's NUMA node membership is a static hardware property
that does not change at runtime.
Proposed Fix
Remove the racy is_cpu_idle() check and its helper function from numa_select_cpu. The NUMA node membership check is preserved as
the core validation of NUMA-aware functionality.
The atomicity of scx_bpf_pick_idle_cpu_node()'s idle-bit clearing
is guaranteed by cpumask_test_and_clear_cpu() — a well-tested
atomic operation in the kernel's cpumask infrastructure. A post-hoc
check from a different CPU cannot reliably validate this atomicity
due to the inherent TOCTOU race.
See patch: 0001-selftests-sched_ext-Fix-flaky-numa-test-by-removing-racy-idle-check.patch
Impact
Before fix: The sched_ext numa test fails intermittently,
causing unrelated patch series to appear as failing. Since sched_ext
runs with continue_on_error: false, this blocks CI and
requires developers to re-run or manually inspect failures.
After fix: The racy assertion is removed. The test still
validates the core NUMA functionality: per-node DSQ creation,
NUMA-aware CPU selection, and correct node membership. The only
removed check is the idle-state verification, which cannot be
reliably performed from a different CPU.
The sched_ext
numaselftest is flaky, failing intermittently in BPF CIdue to an inherently racy idle-state assertion in the
numa_select_cpuBPF program. The test picks an idle CPU (which atomically clears its idle
bit), then immediately re-checks the idle cpumask — but the CPU can
legitimately transition back to idle in the intervening window.
Failure Details
numa(test Run apt-get update by default for GH actions #13 in the sched_ext suite)rt_stalltestscx_bpf_error("CPU 2 should be marked as busy")fromnuma.bpf.c:52numatest)rt_stalltest — same class of race)rt_stalltest)Root Cause Analysis
How the test works
The
numatest (tools/testing/selftests/sched_ext/numa.bpf.c) loads aBPF sched_ext scheduler with per-NUMA-node DSQs and NUMA-aware idle CPU
selection. The
numa_select_cpucallback:Calls
scx_bpf_pick_idle_cpu_node()to find an idle CPU in the task'sNUMA node. This atomically clears the CPU's idle bit via
cpumask_test_and_clear_cpu()inpick_idle_cpu_in_node()(
kernel/sched/ext_idle.c:137).If no idle CPU is found, falls back to
scx_bpf_pick_any_cpu_node().Asserts that the picked CPU is no longer in the idle cpumask
by calling
is_cpu_idle()→scx_bpf_get_idle_cpumask_node()→bpf_cpumask_test_cpu().Asserts that the picked CPU belongs to the correct NUMA node.
The race
Between step 1 (clearing the idle bit) and step 3 (checking the idle
cpumask), the selected CPU can legitimately transition back to idle:
The
update_builtin_idle()call atkernel/sched/ext_idle.c:691uses
assign_cpu()to set the idle bit. This is a normal idletransition — the CPU genuinely has no more work — and is not a bug
in the idle tracking infrastructure.
Why it manifests in CI
The CI VMs run QEMU with 4 vCPUs on a single NUMA node (
No NUMA configuration found). With few CPUs and light workload (the userspacetest just calls
sleep(1)), CPUs frequently oscillate between idleand busy states. The window between
pick_idle_cpuandis_cpu_idleis small but non-zero, and with thousands ofnuma_select_cpuinvocations during the 1-second test window,the race triggers occasionally.
The node check is not racy
The second assertion —
scx_bpf_cpu_node(cpu) != node— is safebecause a CPU's NUMA node membership is a static hardware property
that does not change at runtime.
Proposed Fix
Remove the racy
is_cpu_idle()check and its helper function fromnuma_select_cpu. The NUMA node membership check is preserved asthe core validation of NUMA-aware functionality.
The atomicity of
scx_bpf_pick_idle_cpu_node()'s idle-bit clearingis guaranteed by
cpumask_test_and_clear_cpu()— a well-testedatomic operation in the kernel's cpumask infrastructure. A post-hoc
check from a different CPU cannot reliably validate this atomicity
due to the inherent TOCTOU race.
See patch:
0001-selftests-sched_ext-Fix-flaky-numa-test-by-removing-racy-idle-check.patchImpact
Before fix: The sched_ext
numatest fails intermittently,causing unrelated patch series to appear as failing. Since sched_ext
runs with
continue_on_error: false, this blocks CI andrequires developers to re-run or manually inspect failures.
After fix: The racy assertion is removed. The test still
validates the core NUMA functionality: per-node DSQ creation,
NUMA-aware CPU selection, and correct node membership. The only
removed check is the idle-state verification, which cannot be
reliably performed from a different CPU.
References
numafailure: https://github.com/kernel-patches/bpf/actions/runs/22370810659scx_idle_test_and_clear_cpu():kernel/sched/ext_idle.c:77update_builtin_idle():kernel/sched/ext_idle.c:691__scx_update_idle():kernel/sched/ext_idle.c:734rt_stallflakiness: commit0b82cc331d2e("selftests/sched_ext: Fix rt_stall flaky failure")