Summary
The file_reader/on_open_expect_fault BPF selftest fails 100% of the time in parallel
mode (test_progs -j) on x86_64 with gcc-15. This has been observed across 15+ independent
PRs in the kernel-patches/bpf CI from Feb 19-24, 2026. The failure is not caused by any
specific patch — it is a systemic CI issue that affects every parallel test run.
While the parallel jobs currently use continue_on_error: true (so the failures don't block
CI), they still degrade the signal-to-noise ratio: every parallel run reports at least 2
FAILed tests (this test fails in both test_progs_parallel and test_progs_no_alu32_parallel),
making it harder to notice genuine regressions.
Failure Details
Test: file_reader/on_open_expect_fault (test #122, subtest 1)
Affected jobs: test_progs_parallel and test_progs_no_alu32_parallel on x86_64 gcc-15
Error message:
run_test:PASS:initialize file contents 0 nsec
run_test:PASS:file_reader__open 0 nsec
run_test:PASS:file_reader__load 0 nsec
run_test:PASS:file_reader__attach 0 nsec
run_test:PASS:err 0 nsec
run_test:FAIL:run_success unexpected run_success: actual 0 != expected 1
#122/1 file_reader/on_open_expect_fault:FAIL
Passes in serial mode: The test passes 100% of the time in serial test_progs across all
architectures (x86_64, aarch64, s390x) and toolchains (gcc-15, llvm-21).
Root Cause Analysis
How the test works
-
The userspace test (prog_tests/file_reader.c) reads 256KB from /proc/self/exe into a
buffer, then calls madvise(MADV_PAGEOUT) on a 512KB region of the executable's mapped
memory to evict pages from the page cache.
-
The BPF program (progs/file_reader.c, on_open_expect_fault) is attached to the
lsm/file_open hook. When triggered, it creates a file dynptr from the task's executable
file and attempts to read at offset 256KB (in the paged-out region).
-
Since the BPF program runs in a non-sleepable context, the underlying freader uses
filemap_get_folio() — a pure page cache lookup that cannot initiate I/O. If the pages
are not in cache, the read returns -EFAULT, which the test expects (success case).
Why it fails in parallel mode
In parallel mode, test_progs -j forks 4 worker processes (matching the VM's CPU count).
All workers execute code from the same test_progs binary. As workers run different
tests concurrently, they continuously fault pages of the binary into the page cache through
normal code execution (demand paging of .text section pages).
The pages at file offset 256K–512K of the test_progs binary contain executable code that
other workers actively access. Between the madvise(MADV_PAGEOUT) call and the BPF program's
bpf_dynptr_read(), other workers fault those pages back into cache. When the BPF program
then reads at offset 256K, the read succeeds (returns 0) instead of returning -EFAULT.
When bpf_dynptr_read() returns 0 (unexpected success):
local_err = 0 (return value of the read)
- The
if (local_err == -EFAULT) check fails — run_success is never set to 1
- At the
out label: if (local_err) is false — err is never set
- Result:
err = 0, run_success = 0 — exactly matching the observed failure
Previous fix was insufficient
Commit 5913e936f6d5 ("selftests/bpf: Fix intermittent failures in file_reader test", Oct
2025) addressed a narrower version of this bug: two concurrent instances of the file_reader
test could interfere with each other's page cache state. The fix separated the read regions
for the on_open_expect_fault (256K–512K) and on_open_validate_file_read (0–256K) subtests.
However, the fix only prevents interference between file_reader subtests running concurrently.
In practice, any concurrent process executing code from the same binary brings pages back
into cache. With 4 workers continuously executing test code, pages at offset 256K–512K are
kept hot in cache, making the test fail deterministically rather than intermittently.
Proposed Fix
Rename test_file_reader to serial_test_file_reader. In the test_progs framework, the
serial_test_ prefix ensures the test runs only in the main process after all parallel
workers have exited, guaranteeing exclusive control over page cache state.
This follows the established precedent: serial_test_build_id() in prog_tests/build_id.c
uses the serial prefix for the same reason — it relies on MADV_PAGEOUT to evict pages and
cannot tolerate concurrent processes faulting them back in.
The patch is a one-line change:
-void test_file_reader(void)
+void serial_test_file_reader(void)
Why not mincore() + retry?
The uprobe_multi test uses a mincore()-based retry loop to verify pages are evicted. This
approach works there because uprobe_multi is a separate binary that other workers don't
execute code from. For file_reader, the BPF program reads from the test_progs binary
itself (via bpf_get_task_exe_file()), and other workers continuously fault its pages back
in. A retry loop would spin indefinitely or waste CI time without guaranteeing success.
Impact
- Before fix: Every parallel CI run reports 2 FAILed tests (this test in both
test_progs_parallel and test_progs_no_alu32_parallel). Developers reviewing CI results
must mentally filter these false failures.
- After fix: The test moves to serial execution where it passes reliably. No change to
test coverage — the same assertions are checked; only the execution mode changes.
Example Failing CI Runs
All runs on the kernel-patches/bpf repository from Feb 19–24, 2026:
| Run ID |
PR Title |
Job |
| 22326859477 |
bpf: Allow void return type for global subprogs |
test_progs_parallel |
| 22323542825 |
bpf/verifier: pruning branches |
test_progs_parallel |
| 22314940225 |
powerpc64/bpf: various fixes |
test_progs_parallel |
| 22314958565 |
Introduce KF_FORBID_SLEEP |
test_progs_parallel |
| 22314858282 |
for-next_test |
test_progs_parallel |
| 22313783502 |
devmap: fix stack-out-of-bounds |
test_progs_parallel |
| 22299280479 |
netdev CI testing |
test_progs_parallel |
| 22196940809 |
bpf: Expand usage scenarios of bpf_kptr_xchg |
test_progs_parallel |
| 22186759560 |
bpf: Introduce 64-bit bitops kfuncs |
test_progs_parallel |
| 22233210147 |
libbpf/bpftool: support merging split BTFs |
test_progs_parallel |
Files
tools/testing/selftests/bpf/prog_tests/file_reader.c — one-line fix
- Patch:
0001-selftests-bpf-Make-file_reader-test-serial-to-fix-pa.patch
Summary
The
file_reader/on_open_expect_faultBPF selftest fails 100% of the time in parallelmode (
test_progs -j) on x86_64 with gcc-15. This has been observed across 15+ independentPRs in the
kernel-patches/bpfCI from Feb 19-24, 2026. The failure is not caused by anyspecific patch — it is a systemic CI issue that affects every parallel test run.
While the parallel jobs currently use
continue_on_error: true(so the failures don't blockCI), they still degrade the signal-to-noise ratio: every parallel run reports at least 2
FAILed tests (this test fails in both
test_progs_parallelandtest_progs_no_alu32_parallel),making it harder to notice genuine regressions.
Failure Details
Test:
file_reader/on_open_expect_fault(test #122, subtest 1)Affected jobs:
test_progs_parallelandtest_progs_no_alu32_parallelon x86_64 gcc-15Error message:
Passes in serial mode: The test passes 100% of the time in serial
test_progsacross allarchitectures (x86_64, aarch64, s390x) and toolchains (gcc-15, llvm-21).
Root Cause Analysis
How the test works
The userspace test (
prog_tests/file_reader.c) reads 256KB from/proc/self/exeinto abuffer, then calls
madvise(MADV_PAGEOUT)on a 512KB region of the executable's mappedmemory to evict pages from the page cache.
The BPF program (
progs/file_reader.c,on_open_expect_fault) is attached to thelsm/file_openhook. When triggered, it creates a file dynptr from the task's executablefile and attempts to read at offset 256KB (in the paged-out region).
Since the BPF program runs in a non-sleepable context, the underlying
freaderusesfilemap_get_folio()— a pure page cache lookup that cannot initiate I/O. If the pagesare not in cache, the read returns
-EFAULT, which the test expects (success case).Why it fails in parallel mode
In parallel mode,
test_progs -jforks 4 worker processes (matching the VM's CPU count).All workers execute code from the same
test_progsbinary. As workers run differenttests concurrently, they continuously fault pages of the binary into the page cache through
normal code execution (demand paging of
.textsection pages).The pages at file offset 256K–512K of the
test_progsbinary contain executable code thatother workers actively access. Between the
madvise(MADV_PAGEOUT)call and the BPF program'sbpf_dynptr_read(), other workers fault those pages back into cache. When the BPF programthen reads at offset 256K, the read succeeds (returns 0) instead of returning
-EFAULT.When
bpf_dynptr_read()returns 0 (unexpected success):local_err = 0(return value of the read)if (local_err == -EFAULT)check fails —run_successis never set to 1outlabel:if (local_err)is false —erris never seterr = 0, run_success = 0— exactly matching the observed failurePrevious fix was insufficient
Commit
5913e936f6d5("selftests/bpf: Fix intermittent failures in file_reader test", Oct2025) addressed a narrower version of this bug: two concurrent instances of the file_reader
test could interfere with each other's page cache state. The fix separated the read regions
for the
on_open_expect_fault(256K–512K) andon_open_validate_file_read(0–256K) subtests.However, the fix only prevents interference between file_reader subtests running concurrently.
In practice, any concurrent process executing code from the same binary brings pages back
into cache. With 4 workers continuously executing test code, pages at offset 256K–512K are
kept hot in cache, making the test fail deterministically rather than intermittently.
Proposed Fix
Rename
test_file_readertoserial_test_file_reader. In thetest_progsframework, theserial_test_prefix ensures the test runs only in the main process after all parallelworkers have exited, guaranteeing exclusive control over page cache state.
This follows the established precedent:
serial_test_build_id()inprog_tests/build_id.cuses the serial prefix for the same reason — it relies on
MADV_PAGEOUTto evict pages andcannot tolerate concurrent processes faulting them back in.
The patch is a one-line change:
Why not mincore() + retry?
The
uprobe_multitest uses amincore()-based retry loop to verify pages are evicted. Thisapproach works there because
uprobe_multiis a separate binary that other workers don'texecute code from. For
file_reader, the BPF program reads from thetest_progsbinaryitself (via
bpf_get_task_exe_file()), and other workers continuously fault its pages backin. A retry loop would spin indefinitely or waste CI time without guaranteeing success.
Impact
test_progs_parallelandtest_progs_no_alu32_parallel). Developers reviewing CI resultsmust mentally filter these false failures.
test coverage — the same assertions are checked; only the execution mode changes.
Example Failing CI Runs
All runs on the
kernel-patches/bpfrepository from Feb 19–24, 2026:Files
tools/testing/selftests/bpf/prog_tests/file_reader.c— one-line fix0001-selftests-bpf-Make-file_reader-test-serial-to-fix-pa.patch