Skip to content

[bpf-ci-bot] file_reader/on_open_expect_fault fails deterministically in parallel mode #451

@kernel-patches-review-bot

Description

@kernel-patches-review-bot

Summary

The file_reader/on_open_expect_fault BPF selftest fails 100% of the time in parallel
mode (test_progs -j) on x86_64 with gcc-15. This has been observed across 15+ independent
PRs in the kernel-patches/bpf CI from Feb 19-24, 2026. The failure is not caused by any
specific patch — it is a systemic CI issue that affects every parallel test run.

While the parallel jobs currently use continue_on_error: true (so the failures don't block
CI), they still degrade the signal-to-noise ratio: every parallel run reports at least 2
FAILed tests (this test fails in both test_progs_parallel and test_progs_no_alu32_parallel),
making it harder to notice genuine regressions.

Failure Details

Test: file_reader/on_open_expect_fault (test #122, subtest 1)

Affected jobs: test_progs_parallel and test_progs_no_alu32_parallel on x86_64 gcc-15

Error message:

run_test:PASS:initialize file contents 0 nsec
run_test:PASS:file_reader__open 0 nsec
run_test:PASS:file_reader__load 0 nsec
run_test:PASS:file_reader__attach 0 nsec
run_test:PASS:err 0 nsec
run_test:FAIL:run_success unexpected run_success: actual 0 != expected 1
#122/1   file_reader/on_open_expect_fault:FAIL

Passes in serial mode: The test passes 100% of the time in serial test_progs across all
architectures (x86_64, aarch64, s390x) and toolchains (gcc-15, llvm-21).

Root Cause Analysis

How the test works

  1. The userspace test (prog_tests/file_reader.c) reads 256KB from /proc/self/exe into a
    buffer, then calls madvise(MADV_PAGEOUT) on a 512KB region of the executable's mapped
    memory to evict pages from the page cache.

  2. The BPF program (progs/file_reader.c, on_open_expect_fault) is attached to the
    lsm/file_open hook. When triggered, it creates a file dynptr from the task's executable
    file and attempts to read at offset 256KB (in the paged-out region).

  3. Since the BPF program runs in a non-sleepable context, the underlying freader uses
    filemap_get_folio() — a pure page cache lookup that cannot initiate I/O. If the pages
    are not in cache, the read returns -EFAULT, which the test expects (success case).

Why it fails in parallel mode

In parallel mode, test_progs -j forks 4 worker processes (matching the VM's CPU count).
All workers execute code from the same test_progs binary. As workers run different
tests concurrently, they continuously fault pages of the binary into the page cache through
normal code execution (demand paging of .text section pages).

The pages at file offset 256K–512K of the test_progs binary contain executable code that
other workers actively access. Between the madvise(MADV_PAGEOUT) call and the BPF program's
bpf_dynptr_read(), other workers fault those pages back into cache. When the BPF program
then reads at offset 256K, the read succeeds (returns 0) instead of returning -EFAULT.

When bpf_dynptr_read() returns 0 (unexpected success):

  • local_err = 0 (return value of the read)
  • The if (local_err == -EFAULT) check fails — run_success is never set to 1
  • At the out label: if (local_err) is false — err is never set
  • Result: err = 0, run_success = 0 — exactly matching the observed failure

Previous fix was insufficient

Commit 5913e936f6d5 ("selftests/bpf: Fix intermittent failures in file_reader test", Oct
2025) addressed a narrower version of this bug: two concurrent instances of the file_reader
test could interfere with each other's page cache state. The fix separated the read regions
for the on_open_expect_fault (256K–512K) and on_open_validate_file_read (0–256K) subtests.

However, the fix only prevents interference between file_reader subtests running concurrently.
In practice, any concurrent process executing code from the same binary brings pages back
into cache. With 4 workers continuously executing test code, pages at offset 256K–512K are
kept hot in cache, making the test fail deterministically rather than intermittently.

Proposed Fix

Rename test_file_reader to serial_test_file_reader. In the test_progs framework, the
serial_test_ prefix ensures the test runs only in the main process after all parallel
workers have exited, guaranteeing exclusive control over page cache state.

This follows the established precedent: serial_test_build_id() in prog_tests/build_id.c
uses the serial prefix for the same reason — it relies on MADV_PAGEOUT to evict pages and
cannot tolerate concurrent processes faulting them back in.

The patch is a one-line change:

-void test_file_reader(void)
+void serial_test_file_reader(void)

Why not mincore() + retry?

The uprobe_multi test uses a mincore()-based retry loop to verify pages are evicted. This
approach works there because uprobe_multi is a separate binary that other workers don't
execute code from. For file_reader, the BPF program reads from the test_progs binary
itself (via bpf_get_task_exe_file()), and other workers continuously fault its pages back
in. A retry loop would spin indefinitely or waste CI time without guaranteeing success.

Impact

  • Before fix: Every parallel CI run reports 2 FAILed tests (this test in both
    test_progs_parallel and test_progs_no_alu32_parallel). Developers reviewing CI results
    must mentally filter these false failures.
  • After fix: The test moves to serial execution where it passes reliably. No change to
    test coverage — the same assertions are checked; only the execution mode changes.

Example Failing CI Runs

All runs on the kernel-patches/bpf repository from Feb 19–24, 2026:

Run ID PR Title Job
22326859477 bpf: Allow void return type for global subprogs test_progs_parallel
22323542825 bpf/verifier: pruning branches test_progs_parallel
22314940225 powerpc64/bpf: various fixes test_progs_parallel
22314958565 Introduce KF_FORBID_SLEEP test_progs_parallel
22314858282 for-next_test test_progs_parallel
22313783502 devmap: fix stack-out-of-bounds test_progs_parallel
22299280479 netdev CI testing test_progs_parallel
22196940809 bpf: Expand usage scenarios of bpf_kptr_xchg test_progs_parallel
22186759560 bpf: Introduce 64-bit bitops kfuncs test_progs_parallel
22233210147 libbpf/bpftool: support merging split BTFs test_progs_parallel

Files

  • tools/testing/selftests/bpf/prog_tests/file_reader.c — one-line fix
  • Patch: 0001-selftests-bpf-Make-file_reader-test-serial-to-fix-pa.patch

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions