Skip to content

[bpf-ci-bot] The sched_ext numa selftest is flaky #453

@kernel-patches-review-bot

Description

@kernel-patches-review-bot

The sched_ext numa selftest is flaky, failing intermittently in BPF CI
due to an inherently racy idle-state assertion in the numa_select_cpu
BPF program. The test picks an idle CPU (which atomically clears its idle
bit), then immediately re-checks the idle cpumask — but the CPU can
legitimately transition back to idle in the intervening window.

Failure Details

Root Cause Analysis

How the test works

The numa test (tools/testing/selftests/sched_ext/numa.bpf.c) loads a
BPF sched_ext scheduler with per-NUMA-node DSQs and NUMA-aware idle CPU
selection. The numa_select_cpu callback:

  1. Calls scx_bpf_pick_idle_cpu_node() to find an idle CPU in the task's
    NUMA node. This atomically clears the CPU's idle bit via
    cpumask_test_and_clear_cpu() in pick_idle_cpu_in_node()
    (kernel/sched/ext_idle.c:137).

  2. If no idle CPU is found, falls back to scx_bpf_pick_any_cpu_node().

  3. Asserts that the picked CPU is no longer in the idle cpumask
    by calling is_cpu_idle()scx_bpf_get_idle_cpumask_node()
    bpf_cpumask_test_cpu().

  4. Asserts that the picked CPU belongs to the correct NUMA node.

The race

Between step 1 (clearing the idle bit) and step 3 (checking the idle
cpumask), the selected CPU can legitimately transition back to idle:

CPU 0 (running numa_select_cpu):     CPU 2 (running its last task):
─────────────────────────────────    ─────────────────────────────────
pick_idle_cpu_node() → CPU 2
  cpumask_test_and_clear_cpu(2)      task finishes, enters idle
  (CPU 2 idle bit: 0)                __scx_update_idle(rq, true, true)
                                       update_builtin_idle(2, true)
                                       assign_cpu(2, idle_cpus, true)
                                       (CPU 2 idle bit: 1)
is_cpu_idle(2) → true!
  scx_bpf_error("CPU 2 should
    be marked as busy")

The update_builtin_idle() call at kernel/sched/ext_idle.c:691
uses assign_cpu() to set the idle bit. This is a normal idle
transition — the CPU genuinely has no more work — and is not a bug
in the idle tracking infrastructure.

Why it manifests in CI

The CI VMs run QEMU with 4 vCPUs on a single NUMA node (No NUMA configuration found). With few CPUs and light workload (the userspace
test just calls sleep(1)), CPUs frequently oscillate between idle
and busy states. The window between pick_idle_cpu and
is_cpu_idle is small but non-zero, and with thousands of
numa_select_cpu invocations during the 1-second test window,
the race triggers occasionally.

The node check is not racy

The second assertion — scx_bpf_cpu_node(cpu) != node — is safe
because a CPU's NUMA node membership is a static hardware property
that does not change at runtime.

Proposed Fix

Remove the racy is_cpu_idle() check and its helper function from
numa_select_cpu. The NUMA node membership check is preserved as
the core validation of NUMA-aware functionality.

The atomicity of scx_bpf_pick_idle_cpu_node()'s idle-bit clearing
is guaranteed by cpumask_test_and_clear_cpu() — a well-tested
atomic operation in the kernel's cpumask infrastructure. A post-hoc
check from a different CPU cannot reliably validate this atomicity
due to the inherent TOCTOU race.

See patch: 0001-selftests-sched_ext-Fix-flaky-numa-test-by-removing-racy-idle-check.patch

Impact

  • Before fix: The sched_ext numa test fails intermittently,
    causing unrelated patch series to appear as failing. Since sched_ext
    runs with continue_on_error: false, this blocks CI and
    requires developers to re-run or manually inspect failures.

  • After fix: The racy assertion is removed. The test still
    validates the core NUMA functionality: per-node DSQ creation,
    NUMA-aware CPU selection, and correct node membership. The only
    removed check is the idle-state verification, which cannot be
    reliably performed from a different CPU.

References

  • CI run with numa failure: https://github.com/kernel-patches/bpf/actions/runs/22370810659
  • scx_idle_test_and_clear_cpu(): kernel/sched/ext_idle.c:77
  • update_builtin_idle(): kernel/sched/ext_idle.c:691
  • __scx_update_idle(): kernel/sched/ext_idle.c:734
  • Related fix for rt_stall flakiness: commit 0b82cc331d2e ("selftests/sched_ext: Fix rt_stall flaky failure")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions