[6.17] RDMA/core: Fix stale RoCE GIDs during netdev events at registration by nirmoy · Pull Request #358 · NVIDIA/NV-Kernels

nirmoy · 2026-04-14T13:18:05Z

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2148311

Summary

Cherry-pick of upstream commit 9af0fea.

RoCE GID entries become stale when netdev properties (e.g., MAC address)
change during the IB device registration window. This causes VFs to show
GIDs derived from an old/random MAC rather than the configured one,
leading to wrong GID index assignment.

The root cause is a race: ib_enum_all_roce_netdevs() only iterates
devices marked DEVICE_REGISTERED, but that mark is set late — after
the GID cache is already populated. NETDEV_CHANGEADDR events arriving
in this window are silently dropped.

Fix introduces a new xarray mark DEVICE_GID_UPDATES set immediately
after GID table allocation. ib_enum_all_roce_netdevs() uses this mark
instead of DEVICE_REGISTERED, so in-flight netdev events are no longer
lost. Also fixes the same race for IP address events.

Upstream commit: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=9af0feae8016ba

RoCE GID entries become stale when netdev properties change during the IB device registration window. This is reproducible with a udev rule that sets a MAC address when a VF netdev appears: ACTION=="add", SUBSYSTEM=="net", KERNEL=="eth4", \ RUN+="/sbin/ip link set eth4 address 88:22:33:44:55:66" After VF creation, show_gids displays GIDs derived from the original random MAC rather than the configured one. The root cause is a race between netdev event processing and device registration: CPU 0 (driver) CPU 1 (udev/workqueue) ────────────── ────────────────────── ib_register_device() ib_cache_setup_one() gid_table_setup_one() _gid_table_setup_one() ← GID table allocated rdma_roce_rescan_device() ← GIDs populated with OLD MAC ip link set eth4 addr NEW_MAC NETDEV_CHANGEADDR queued netdevice_event_work_handler() ib_enum_all_roce_netdevs() ← Iterates DEVICE_REGISTERED ← Device NOT marked yet, SKIP! enable_device_and_get() xa_set_mark(DEVICE_REGISTERED) ← Too late, event was lost The netdev event handler uses ib_enum_all_roce_netdevs() which only iterates devices marked DEVICE_REGISTERED. However, this mark is set late in the registration process, after the GID cache is already populated. Events arriving in this window are silently dropped. Fix this by introducing a new xarray mark DEVICE_GID_UPDATES that is set immediately after the GID table is allocated and initialized. Use the new mark in ib_enum_all_roce_netdevs() function to iterate devices instead of DEVICE_REGISTERED. This is safe because: - After _gid_table_setup_one(), all required structures exist (port_data, immutable, cache.gid) - The GID table mutex serializes concurrent access between the initial rescan and event handlers - Event handlers correctly update stale GIDs even when racing with rescan - The mark is cleared in ib_cache_cleanup_one() before teardown This also fixes similar races for IP address events (inetaddr_event, inet6addr_event) which use the same enumeration path. Fixes: 0df91bb ("RDMA/devices: Use xarray to store the client_data") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20260127093839.126291-1-jiri@resnulli.us Reported-by: syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84 Signed-off-by: Leon Romanovsky <leon@kernel.org> (cherry picked from commit 9af0fea) Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>

nvmochs · 2026-04-14T14:47:15Z

This patch went into v7.0, so no need to duplicate this backport for our 7.0 branches. I verified that there are currently no upstream fixes for this patch and that the patch picked clean.

LGTM!

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

clsotog

Acked-by: Carol L Soto <csoto@nvidia.com>

nvmochs · 2026-04-15T20:11:47Z

Merged and present in https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-nvidia/+git/noble/log/?h=nvidia-6.17-next

Closing PR.

nirmoy mentioned this pull request Apr 14, 2026

[6.14] RDMA/core: Fix stale RoCE GIDs during netdev events at registration #359

Closed

clsotog self-requested a review April 14, 2026 15:28

clsotog approved these changes Apr 14, 2026

View reviewed changes

nvmochs closed this Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[6.17] RDMA/core: Fix stale RoCE GIDs during netdev events at registration#358

[6.17] RDMA/core: Fix stale RoCE GIDs during netdev events at registration#358
nirmoy wants to merge 1 commit intoNVIDIA:24.04_linux-nvidia-6.17-nextfrom
nirmoy:mlx_fix

nirmoy commented Apr 14, 2026

Uh oh!

nvmochs commented Apr 14, 2026

Uh oh!

clsotog left a comment

Uh oh!

nvmochs commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nirmoy commented Apr 14, 2026

Summary

Uh oh!

nvmochs commented Apr 14, 2026

Uh oh!

clsotog left a comment

Choose a reason for hiding this comment

Uh oh!

nvmochs commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants