[bpf-ci-bot] Flaky test: wq/ok_sleepable on s390x due to insufficient sleep


## Summary

The `wq` selftest (`serial_test_wq`) flakes on s390x because it waits only 50 microseconds (`usleep(50)`) for a workqueue callback to complete. The workqueue callback is scheduled via `schedule_work()` on `system_wq` (per-CPU bound), and the kworker thread may not be scheduled quickly enough on s390x to complete within 50 microseconds.

## Failure Details
- **Test / Component:** `wq` (serial_test_wq) in test_progs_no_alu32
- **Frequency:** Rare — observed in 1 of 8 examined runs, but only on s390x test_progs_no_alu32. The same test passed on s390x test_progs in the same CI run.
- **Failure mode:** Flaky — `ok_sleepable` reads 0 instead of expected 2, meaning the workqueue callback never executed before the check.
- **Affected architectures:** s390x (observed), potentially any architecture under load
- **CI runs observed:**
  - https://github.com/kernel-patches/bpf/actions/runs/22374083398 (FAIL — s390x test_progs_no_alu32, wq:FAIL)
  - https://github.com/kernel-patches/bpf/actions/runs/22365755214 (PASS — s390x test_progs_cpuv4, wq:OK)
  - https://github.com/kernel-patches/bpf/actions/runs/22346468790 (PASS — s390x test_progs_no_alu32, wq:OK)

## Root Cause Analysis

The test `serial_test_wq` (`tools/testing/selftests/bpf/prog_tests/wq.c:7`) opens a BPF skeleton, runs `test_syscall_array_sleepable` (which calls `bpf_wq_start` to schedule a workqueue callback), then sleeps 50 microseconds and checks `ok_sleepable`.

The call chain is:
1. `test_syscall_array_sleepable` → `test_elem_callback(&array, &key, wq_cb_sleepable)` — initializes and starts the workqueue
2. `bpf_wq_start` (`kernel/bpf/helpers.c:3177`) → `schedule_work(&w->work)` — queues `bpf_wq_work` on `system_wq`
3. `bpf_wq_work` (`kernel/bpf/helpers.c:1200`) → runs `wq_cb_sleepable` → sets `ok_sleepable |= (1 << 1)`

The BPF program runs under `migrate_disable()` (from `bpf_prog_run_pin_on_cpu`), pinning execution to one CPU. The work is queued on that same CPU's `system_wq` worker pool. After the syscall returns, the kworker thread must be scheduled to process the work item.

On s390x, workqueue scheduling latency can exceed 50 microseconds, causing the test to read `ok_sleepable` before the callback has fired. The comment in the test says "10 usecs should be enough, but give it extra" — but 50 usecs is not enough margin.

The issue is likely exacerbated by the refactoring in `1bfbc267ec91` ("bpf: Enable bpf_timer and bpf_wq in any context"), which added atomic refcount operations (`refcount_inc_not_zero`, `bpf_async_refcount_put`) to the `bpf_wq_start` path, adding marginal overhead.

## Proposed Fix

Replace `usleep(50)` with a polling loop that checks `ok_sleepable` every 1ms, up to 100ms total. This gives the workqueue callback ample time to complete while still exiting quickly in the common case (typically 1-2 iterations). See attached patch.

## Impact

Without the fix, the `wq` test will continue to flake intermittently on s390x, causing false CI failures that waste developer time investigating unrelated test breakage.

## References
- `tools/testing/selftests/bpf/prog_tests/wq.c:31` — the `usleep(50)` that is too short
- `kernel/bpf/helpers.c:3177` — `bpf_wq_start` function
- `kernel/bpf/helpers.c:1200` — `bpf_wq_work` workqueue callback
- `82e38a505c98` ("selftests/bpf: Fix wq test.") — original fix that acknowledged delayed callbacks
- `1bfbc267ec91` ("bpf: Enable bpf_timer and bpf_wq in any context") — refactoring that added overhead


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bpf-ci-bot] Flaky test: wq/ok_sleepable on s390x due to insufficient sleep #455

Summary

Failure Details

Root Cause Analysis

Proposed Fix

Impact

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[bpf-ci-bot] Flaky test: wq/ok_sleepable on s390x due to insufficient sleep #455

Description

Summary

Failure Details

Root Cause Analysis

Proposed Fix

Impact

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions