[bpf-ci-bot] file_reader/on_open_expect_fault fails deterministically in parallel mode


## Summary

The `file_reader/on_open_expect_fault` BPF selftest fails **100% of the time** in parallel
mode (`test_progs -j`) on x86_64 with gcc-15. This has been observed across 15+ independent
PRs in the `kernel-patches/bpf` CI from Feb 19-24, 2026. The failure is not caused by any
specific patch — it is a systemic CI issue that affects every parallel test run.

While the parallel jobs currently use `continue_on_error: true` (so the failures don't block
CI), they still degrade the signal-to-noise ratio: every parallel run reports at least 2
FAILed tests (this test fails in both `test_progs_parallel` and `test_progs_no_alu32_parallel`),
making it harder to notice genuine regressions.

## Failure Details

**Test:** `file_reader/on_open_expect_fault` (test #122, subtest 1)

**Affected jobs:** `test_progs_parallel` and `test_progs_no_alu32_parallel` on x86_64 gcc-15

**Error message:**
```
run_test:PASS:initialize file contents 0 nsec
run_test:PASS:file_reader__open 0 nsec
run_test:PASS:file_reader__load 0 nsec
run_test:PASS:file_reader__attach 0 nsec
run_test:PASS:err 0 nsec
run_test:FAIL:run_success unexpected run_success: actual 0 != expected 1
#122/1   file_reader/on_open_expect_fault:FAIL
```

**Passes in serial mode:** The test passes 100% of the time in serial `test_progs` across all
architectures (x86_64, aarch64, s390x) and toolchains (gcc-15, llvm-21).

## Root Cause Analysis

### How the test works

1. The userspace test (`prog_tests/file_reader.c`) reads 256KB from `/proc/self/exe` into a
   buffer, then calls `madvise(MADV_PAGEOUT)` on a 512KB region of the executable's mapped
   memory to evict pages from the page cache.

2. The BPF program (`progs/file_reader.c`, `on_open_expect_fault`) is attached to the
   `lsm/file_open` hook. When triggered, it creates a file dynptr from the task's executable
   file and attempts to read at offset 256KB (in the paged-out region).

3. Since the BPF program runs in a **non-sleepable** context, the underlying `freader` uses
   `filemap_get_folio()` — a pure page cache lookup that cannot initiate I/O. If the pages
   are not in cache, the read returns `-EFAULT`, which the test expects (success case).

### Why it fails in parallel mode

In parallel mode, `test_progs -j` forks 4 worker processes (matching the VM's CPU count).
All workers execute code from the **same `test_progs` binary**. As workers run different
tests concurrently, they continuously fault pages of the binary into the page cache through
normal code execution (demand paging of `.text` section pages).

The pages at file offset 256K–512K of the `test_progs` binary contain executable code that
other workers actively access. Between the `madvise(MADV_PAGEOUT)` call and the BPF program's
`bpf_dynptr_read()`, other workers fault those pages back into cache. When the BPF program
then reads at offset 256K, the read **succeeds** (returns 0) instead of returning `-EFAULT`.

When `bpf_dynptr_read()` returns 0 (unexpected success):
- `local_err = 0` (return value of the read)
- The `if (local_err == -EFAULT)` check fails — `run_success` is never set to 1
- At the `out` label: `if (local_err)` is false — `err` is never set
- Result: `err = 0, run_success = 0` — exactly matching the observed failure

### Previous fix was insufficient

Commit `5913e936f6d5` ("selftests/bpf: Fix intermittent failures in file_reader test", Oct
2025) addressed a narrower version of this bug: two concurrent instances of the file_reader
test could interfere with each other's page cache state. The fix separated the read regions
for the `on_open_expect_fault` (256K–512K) and `on_open_validate_file_read` (0–256K) subtests.

However, the fix only prevents interference between file_reader subtests running concurrently.
In practice, **any** concurrent process executing code from the same binary brings pages back
into cache. With 4 workers continuously executing test code, pages at offset 256K–512K are
kept hot in cache, making the test fail deterministically rather than intermittently.

## Proposed Fix

Rename `test_file_reader` to `serial_test_file_reader`. In the `test_progs` framework, the
`serial_test_` prefix ensures the test runs only in the main process after all parallel
workers have exited, guaranteeing exclusive control over page cache state.

This follows the established precedent: `serial_test_build_id()` in `prog_tests/build_id.c`
uses the serial prefix for the same reason — it relies on `MADV_PAGEOUT` to evict pages and
cannot tolerate concurrent processes faulting them back in.

The patch is a one-line change:
```diff
-void test_file_reader(void)
+void serial_test_file_reader(void)
```

### Why not mincore() + retry?

The `uprobe_multi` test uses a `mincore()`-based retry loop to verify pages are evicted. This
approach works there because `uprobe_multi` is a separate binary that other workers don't
execute code from. For `file_reader`, the BPF program reads from the `test_progs` binary
itself (via `bpf_get_task_exe_file()`), and other workers continuously fault its pages back
in. A retry loop would spin indefinitely or waste CI time without guaranteeing success.

## Impact

- **Before fix:** Every parallel CI run reports 2 FAILed tests (this test in both
  `test_progs_parallel` and `test_progs_no_alu32_parallel`). Developers reviewing CI results
  must mentally filter these false failures.
- **After fix:** The test moves to serial execution where it passes reliably. No change to
  test coverage — the same assertions are checked; only the execution mode changes.

## Example Failing CI Runs

All runs on the `kernel-patches/bpf` repository from Feb 19–24, 2026:

| Run ID | PR Title | Job |
|--------|----------|-----|
| 22326859477 | bpf: Allow void return type for global subprogs | test_progs_parallel |
| 22323542825 | bpf/verifier: pruning branches | test_progs_parallel |
| 22314940225 | powerpc64/bpf: various fixes | test_progs_parallel |
| 22314958565 | Introduce KF_FORBID_SLEEP | test_progs_parallel |
| 22314858282 | for-next_test | test_progs_parallel |
| 22313783502 | devmap: fix stack-out-of-bounds | test_progs_parallel |
| 22299280479 | netdev CI testing | test_progs_parallel |
| 22196940809 | bpf: Expand usage scenarios of bpf_kptr_xchg | test_progs_parallel |
| 22186759560 | bpf: Introduce 64-bit bitops kfuncs | test_progs_parallel |
| 22233210147 | libbpf/bpftool: support merging split BTFs | test_progs_parallel |

## Files

- `tools/testing/selftests/bpf/prog_tests/file_reader.c` — one-line fix
- Patch: `0001-selftests-bpf-Make-file_reader-test-serial-to-fix-pa.patch`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bpf-ci-bot] file_reader/on_open_expect_fault fails deterministically in parallel mode #451

Summary

Failure Details

Root Cause Analysis

How the test works

Why it fails in parallel mode

Previous fix was insufficient

Proposed Fix

Why not mincore() + retry?

Impact

Example Failing CI Runs

Files

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Run ID	PR Title	Job
22326859477	bpf: Allow void return type for global subprogs	test_progs_parallel
22323542825	bpf/verifier: pruning branches	test_progs_parallel
22314940225	powerpc64/bpf: various fixes	test_progs_parallel
22314958565	Introduce KF_FORBID_SLEEP	test_progs_parallel
22314858282	for-next_test	test_progs_parallel
22313783502	devmap: fix stack-out-of-bounds	test_progs_parallel
22299280479	netdev CI testing	test_progs_parallel
22196940809	bpf: Expand usage scenarios of bpf_kptr_xchg	test_progs_parallel
22186759560	bpf: Introduce 64-bit bitops kfuncs	test_progs_parallel
22233210147	libbpf/bpftool: support merging split BTFs	test_progs_parallel

[bpf-ci-bot] file_reader/on_open_expect_fault fails deterministically in parallel mode #451

Description

Summary

Failure Details

Root Cause Analysis

How the test works

Why it fails in parallel mode

Previous fix was insufficient

Proposed Fix

Why not mincore() + retry?

Impact

Example Failing CI Runs

Files

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions