The sock_iter_batch/udp test flakes on s390x cpuv4 (and potentially other architectures) because the kernel's UDP auto-port assignment can assign the same port to both SO_REUSEPORT socket groups. When ports[0] == ports[1], the BPF program assigns idx=0 to all sockets, causing the second_idx assertion to fail in the second read loop.
Failure Details
- Test / Component:
sock_iter_batch/udp (prog_tests/sock_iter_batch.c do_test())
- Frequency: Observed in baseline CI (not PR-specific); likely rare but reproducible when port auto-assignment collides
- Failure mode: Wrong result (flaky) —
second_idx: actual 0 != expected 1
- Affected architectures: s390x cpuv4 observed; any architecture possible depending on port randomization
- CI runs observed:
Root Cause Analysis
The test creates two groups of 4 SO_REUSEPORT UDP sockets (do_test at prog_tests/sock_iter_batch.c:866), each binding to auto-assigned port 0 via start_reuseport_server(). The auto-assigned ports are stored in skel->rodata->ports[0] and skel->rodata->ports[1] (line 890).
The kernel's udp_lib_lport_inuse() (net/ipv4/udp.c:141) builds a bitmap of "in use" ports for auto-assignment. However, when both the existing and new sockets have SO_REUSEPORT set and share the same UID, the existing port's bit is not set in the bitmap (lines 158-163):
if (sk2->sk_reuseport && sk->sk_reuseport &&
!rcu_access_pointer(sk->sk_reuseport_cb) &&
uid_eq(uid, sk_uid(sk2))) {
if (!bitmap)
return 0;
/* bit NOT set — port appears available */
}
This allows the second group's bind(port=0) to receive the same port as the first group.
When ports[0] == ports[1], the BPF program iter_udp_soreuse (progs/sock_iter_batch.c:115-118) always takes the idx=0 branch (since sk->sk_num == ports[0] is checked first and is true for ALL sockets). The test then:
- Reads 3 sockets with
idx=0 in the first loop → first_idx=0
- Closes
fds[first_idx] (first group)
- Sets
second_idx = !first_idx = 1
- Reads remaining sockets — still
idx=0 (since ports[0]==ports[1])
- Assertion
ASSERT_EQ(outputs[i].idx, second_idx) fails: 0 != 1
The existing bucket collision guard at line 957 only protects the total_read assertion, not the second_idx assertion at line 947.
Proposed Fix
Add an assertion that the two auto-assigned ports are distinct, immediately after socket creation (prog_tests/sock_iter_batch.c:891). See attached patch: 0001-selftests-bpf-fix-sock_iter_batch-flake-due-to-SO_RE.patch.
This converts the cryptic second_idx failure into a clear distinct_ports assertion, making the root cause immediately obvious. The collision is inherent to SO_REUSEPORT + auto-port-assignment and cannot be prevented at the test level without using explicit ports.
Impact
Without the fix, the sock_iter_batch/udp subtest intermittently fails in CI with a misleading second_idx error that looks like a BPF iterator bug. This creates noise in CI results and wastes developer time investigating iterator logic when the actual issue is port auto-assignment behavior.
References
net/ipv4/udp.c:141 — udp_lib_lport_inuse() SO_REUSEPORT bitmap skip
tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c:866 — do_test()
tools/testing/selftests/bpf/progs/sock_iter_batch.c:115 — BPF program port comparison
- Commit dbd7db7787ba ("selftests/bpf: Test udp and tcp iter batching") — original test
The
sock_iter_batch/udptest flakes on s390x cpuv4 (and potentially other architectures) because the kernel's UDP auto-port assignment can assign the same port to both SO_REUSEPORT socket groups. Whenports[0] == ports[1], the BPF program assignsidx=0to all sockets, causing thesecond_idxassertion to fail in the second read loop.Failure Details
sock_iter_batch/udp(prog_tests/sock_iter_batch.cdo_test())second_idx: actual 0 != expected 1Root Cause Analysis
The test creates two groups of 4 SO_REUSEPORT UDP sockets (
do_testatprog_tests/sock_iter_batch.c:866), each binding to auto-assigned port 0 viastart_reuseport_server(). The auto-assigned ports are stored inskel->rodata->ports[0]andskel->rodata->ports[1](line 890).The kernel's
udp_lib_lport_inuse()(net/ipv4/udp.c:141) builds a bitmap of "in use" ports for auto-assignment. However, when both the existing and new sockets haveSO_REUSEPORTset and share the same UID, the existing port's bit is not set in the bitmap (lines 158-163):This allows the second group's
bind(port=0)to receive the same port as the first group.When
ports[0] == ports[1], the BPF programiter_udp_soreuse(progs/sock_iter_batch.c:115-118) always takes theidx=0branch (sincesk->sk_num == ports[0]is checked first and is true for ALL sockets). The test then:idx=0in the first loop →first_idx=0fds[first_idx](first group)second_idx = !first_idx = 1idx=0(sinceports[0]==ports[1])ASSERT_EQ(outputs[i].idx, second_idx)fails:0 != 1The existing bucket collision guard at line 957 only protects the
total_readassertion, not thesecond_idxassertion at line 947.Proposed Fix
Add an assertion that the two auto-assigned ports are distinct, immediately after socket creation (
prog_tests/sock_iter_batch.c:891). See attached patch:0001-selftests-bpf-fix-sock_iter_batch-flake-due-to-SO_RE.patch.This converts the cryptic
second_idxfailure into a cleardistinct_portsassertion, making the root cause immediately obvious. The collision is inherent to SO_REUSEPORT + auto-port-assignment and cannot be prevented at the test level without using explicit ports.Impact
Without the fix, the
sock_iter_batch/udpsubtest intermittently fails in CI with a misleadingsecond_idxerror that looks like a BPF iterator bug. This creates noise in CI results and wastes developer time investigating iterator logic when the actual issue is port auto-assignment behavior.References
net/ipv4/udp.c:141—udp_lib_lport_inuse()SO_REUSEPORT bitmap skiptools/testing/selftests/bpf/prog_tests/sock_iter_batch.c:866—do_test()tools/testing/selftests/bpf/progs/sock_iter_batch.c:115— BPF program port comparison