Skip to content

[bpf-ci-bot] Flaky test: wq/ok_sleepable on s390x due to insufficient sleep #455

@kernel-patches-review-bot

Description

@kernel-patches-review-bot

Summary

The wq selftest (serial_test_wq) flakes on s390x because it waits only 50 microseconds (usleep(50)) for a workqueue callback to complete. The workqueue callback is scheduled via schedule_work() on system_wq (per-CPU bound), and the kworker thread may not be scheduled quickly enough on s390x to complete within 50 microseconds.

Failure Details

Root Cause Analysis

The test serial_test_wq (tools/testing/selftests/bpf/prog_tests/wq.c:7) opens a BPF skeleton, runs test_syscall_array_sleepable (which calls bpf_wq_start to schedule a workqueue callback), then sleeps 50 microseconds and checks ok_sleepable.

The call chain is:

  1. test_syscall_array_sleepabletest_elem_callback(&array, &key, wq_cb_sleepable) — initializes and starts the workqueue
  2. bpf_wq_start (kernel/bpf/helpers.c:3177) → schedule_work(&w->work) — queues bpf_wq_work on system_wq
  3. bpf_wq_work (kernel/bpf/helpers.c:1200) → runs wq_cb_sleepable → sets ok_sleepable |= (1 << 1)

The BPF program runs under migrate_disable() (from bpf_prog_run_pin_on_cpu), pinning execution to one CPU. The work is queued on that same CPU's system_wq worker pool. After the syscall returns, the kworker thread must be scheduled to process the work item.

On s390x, workqueue scheduling latency can exceed 50 microseconds, causing the test to read ok_sleepable before the callback has fired. The comment in the test says "10 usecs should be enough, but give it extra" — but 50 usecs is not enough margin.

The issue is likely exacerbated by the refactoring in 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context"), which added atomic refcount operations (refcount_inc_not_zero, bpf_async_refcount_put) to the bpf_wq_start path, adding marginal overhead.

Proposed Fix

Replace usleep(50) with a polling loop that checks ok_sleepable every 1ms, up to 100ms total. This gives the workqueue callback ample time to complete while still exiting quickly in the common case (typically 1-2 iterations). See attached patch.

Impact

Without the fix, the wq test will continue to flake intermittently on s390x, causing false CI failures that waste developer time investigating unrelated test breakage.

References

  • tools/testing/selftests/bpf/prog_tests/wq.c:31 — the usleep(50) that is too short
  • kernel/bpf/helpers.c:3177bpf_wq_start function
  • kernel/bpf/helpers.c:1200bpf_wq_work workqueue callback
  • 82e38a505c98 ("selftests/bpf: Fix wq test.") — original fix that acknowledged delayed callbacks
  • 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context") — refactoring that added overhead

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions