[bpf-ci-bot] The sched_ext numa selftest is flaky


The sched_ext `numa` selftest is flaky, failing intermittently in BPF CI
due to an inherently racy idle-state assertion in the `numa_select_cpu`
BPF program. The test picks an idle CPU (which atomically clears its idle
bit), then immediately re-checks the idle cpumask — but the CPU can
legitimately transition back to idle in the intervening window.

## Failure Details
- **Test / Component:** sched_ext selftest `numa` (test #13 in the sched_ext suite)
- **Frequency:** Occasional (~1 in 10 sched_ext CI runs); observed on 1 of 12 examined runs, plus 2 additional runs failing on the related `rt_stall` test
- **Failure mode:** Flaky — `scx_bpf_error("CPU 2 should be marked as busy")` from `numa.bpf.c:52`
- **Affected architectures:** x86_64 (both gcc-15 and llvm-21)
- **CI runs observed:**
  - https://github.com/kernel-patches/bpf/actions/runs/22370810659 (sched_ext on x86_64 gcc-15, `numa` test)
  - https://github.com/kernel-patches/bpf/actions/runs/22334731571 (sched_ext on x86_64 llvm-21, `rt_stall` test — same class of race)
  - https://github.com/kernel-patches/bpf/actions/runs/22333780978 (sched_ext on x86_64 gcc-15, `rt_stall` test)

## Root Cause Analysis

### How the test works

The `numa` test (`tools/testing/selftests/sched_ext/numa.bpf.c`) loads a
BPF sched_ext scheduler with per-NUMA-node DSQs and NUMA-aware idle CPU
selection. The `numa_select_cpu` callback:

1. Calls `scx_bpf_pick_idle_cpu_node()` to find an idle CPU in the task's
   NUMA node. This atomically clears the CPU's idle bit via
   `cpumask_test_and_clear_cpu()` in `pick_idle_cpu_in_node()`
   (`kernel/sched/ext_idle.c:137`).

2. If no idle CPU is found, falls back to `scx_bpf_pick_any_cpu_node()`.

3. **Asserts** that the picked CPU is no longer in the idle cpumask
   by calling `is_cpu_idle()` → `scx_bpf_get_idle_cpumask_node()` →
   `bpf_cpumask_test_cpu()`.

4. Asserts that the picked CPU belongs to the correct NUMA node.

### The race

Between step 1 (clearing the idle bit) and step 3 (checking the idle
cpumask), the selected CPU can legitimately transition back to idle:

```
CPU 0 (running numa_select_cpu):     CPU 2 (running its last task):
─────────────────────────────────    ─────────────────────────────────
pick_idle_cpu_node() → CPU 2
  cpumask_test_and_clear_cpu(2)      task finishes, enters idle
  (CPU 2 idle bit: 0)                __scx_update_idle(rq, true, true)
                                       update_builtin_idle(2, true)
                                       assign_cpu(2, idle_cpus, true)
                                       (CPU 2 idle bit: 1)
is_cpu_idle(2) → true!
  scx_bpf_error("CPU 2 should
    be marked as busy")
```

The `update_builtin_idle()` call at `kernel/sched/ext_idle.c:691`
uses `assign_cpu()` to set the idle bit. This is a normal idle
transition — the CPU genuinely has no more work — and is not a bug
in the idle tracking infrastructure.

### Why it manifests in CI

The CI VMs run QEMU with 4 vCPUs on a single NUMA node (`No NUMA
configuration found`). With few CPUs and light workload (the userspace
test just calls `sleep(1)`), CPUs frequently oscillate between idle
and busy states. The window between `pick_idle_cpu` and
`is_cpu_idle` is small but non-zero, and with thousands of
`numa_select_cpu` invocations during the 1-second test window,
the race triggers occasionally.

### The node check is not racy

The second assertion — `scx_bpf_cpu_node(cpu) != node` — is safe
because a CPU's NUMA node membership is a static hardware property
that does not change at runtime.

## Proposed Fix

Remove the racy `is_cpu_idle()` check and its helper function from
`numa_select_cpu`. The NUMA node membership check is preserved as
the core validation of NUMA-aware functionality.

The atomicity of `scx_bpf_pick_idle_cpu_node()`'s idle-bit clearing
is guaranteed by `cpumask_test_and_clear_cpu()` — a well-tested
atomic operation in the kernel's cpumask infrastructure. A post-hoc
check from a different CPU cannot reliably validate this atomicity
due to the inherent TOCTOU race.

See patch: `0001-selftests-sched_ext-Fix-flaky-numa-test-by-removing-racy-idle-check.patch`

## Impact

- **Before fix:** The sched_ext `numa` test fails intermittently,
  causing unrelated patch series to appear as failing. Since sched_ext
  runs with `continue_on_error: false`, this **blocks CI** and
  requires developers to re-run or manually inspect failures.

- **After fix:** The racy assertion is removed. The test still
  validates the core NUMA functionality: per-node DSQ creation,
  NUMA-aware CPU selection, and correct node membership. The only
  removed check is the idle-state verification, which cannot be
  reliably performed from a different CPU.

## References

- CI run with `numa` failure: https://github.com/kernel-patches/bpf/actions/runs/22370810659
- `scx_idle_test_and_clear_cpu()`: `kernel/sched/ext_idle.c:77`
- `update_builtin_idle()`: `kernel/sched/ext_idle.c:691`
- `__scx_update_idle()`: `kernel/sched/ext_idle.c:734`
- Related fix for `rt_stall` flakiness: commit `0b82cc331d2e` ("selftests/sched_ext: Fix rt_stall flaky failure")


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bpf-ci-bot] The sched_ext numa selftest is flaky #453

Failure Details

Root Cause Analysis

How the test works

The race

Why it manifests in CI

The node check is not racy

Proposed Fix

Impact

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[bpf-ci-bot] The sched_ext numa selftest is flaky #453

Description

Failure Details

Root Cause Analysis

How the test works

The race

Why it manifests in CI

The node check is not racy

Proposed Fix

Impact

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions