Fix empty-Wlid scan dispatch in operator SBOM watcher by yugal07 · Pull Request #379 · kubescape/operator

yugal07 · 2026-05-27T12:57:10Z

Summary

When an SBOMSyft event was processed before the owning Pod had populated ImageToContainerData — either during the startup race (initial SBOM list paged before the pod informer warmed up) or unconditionally when ServiceDiscovery.Enabled == false — the operator dispatched a TypeScanImages command with Wlid="". kubevuln then ran the scan but its if workload.Wlid != "" guard silently dropped the result from the platform CVE submission path. No log above info was emitted, so the lost attribution was invisible to operators.

This PR closes the race at the source: empty Wlid is now a validation failure, and SBOMs that hit it are parked and retried with bounded exponential backoff rather than dispatched as malformed scans.

Changes

operator/watcher/watchhandler.go

Added ErrMissingWlid (alias of existing ErrMissingWLID) for the new validation branch.
Added sbomRetryAttempts maps.SafeMap[string, int] to track per-SBOM retry counts.

operator/watcher/sbomwatcher.go

validateContainerData now returns ErrMissingWlid when Wlid == "".
HandleSBOMEvents treats ErrMissingWlid as transient: re-enqueues the SBOM after a delay that doubles each attempt (2s → 4s → 8s → … capped at 60s) for up to 5 attempts, then drops with a warning log + ErrMissingWlid on errorCh. Retry bookkeeping is cleared on success and on exhaustion to avoid leaks.
Backoff parameters live in package-level vars (sbomRetryInitialDelay, sbomRetryMaxDelay, sbomRetryMaxAttempts) so tests can shorten them.

operator/watcher/sbomwatcher_test.go

Updated TestHandleSBOMEvents to seed ImageToContainerData so the existing cases now assert the correct Wlid is carried on the command (the prior expectedCommands literally encoded the bug — empty Wlid on dispatched commands).
New TestValidateContainerData — table-driven coverage of all four validation branches, including the #378 regression case.
New TestSBOMRetryBackoff — verifies the doubling schedule, the maxDelay cap, and that absurd attempt counts (e.g. 64, which would overflow the shift) stay capped instead of returning a non-positive duration.
New TestHandleSBOMEvents_WlidArrivesLate — reproduces the race: SBOM arrives first with empty map; pod data lands mid-retry; asserts the eventual scan command carries the correct Wlid and that retry bookkeeping is cleared.
New TestHandleSBOMEvents_WlidNeverArrives_ExhaustsRetries — pod data never arrives; asserts retries are bounded, ErrMissingWlid is surfaced exactly once, no scan command is ever dispatched, and retry bookkeeping is cleared.

Fixes

#378

Reproduction

Verified end-to-end with a temporary repro test that uses only the pre-fix API:

Code state	Repro outcome	Meaning
Unfixed (fix stashed)	PASS — asserts dispatched command has `Wlid=""`	Bug reproduced: malformed scan dispatched
Fixed (this PR)	FAIL — "no scan command and no error within 2s — handler is parking the event"	SBOM is parked for retry instead of dispatched as malformed

The repro file was deleted after verification; the two new permanent regression tests above cover the same paths.

Test plan

go build ./... clean
go vet ./watcher/ clean
go test ./watcher/ -count=1 — full watcher suite green (~91s)
Manually reproduced the bug against unfixed code (repro PASS) and confirmed the fix closes it (repro FAIL, with diagnostic "handler is parking the event")

Notes for reviewers

The deeper "list pods first, then SBOMs" startup-ordering fix suggested in the issue is not included here, it would require restructuring SBOMWatch and waiting on a pod-informer sync signal that doesn't exist today. The current fix is sufficient to eliminate the empty-Wlid dispatch and is independently valuable even if a later PR also fixes the ordering. I'd dedicate another few days on that long term solution after my exams 😄
ErrMissingWlid is intentionally aliased to the existing ErrMissingWLID rather than introduced as a separate error, so callers checking either spelling will work.
The FIXME at sbomwatcher.go:82 (which the issue called out specifically) is now correct in spirit: the race still exists at the select, but it no longer produces malformed scans, the validator + retry loop absorb it.

Summary by CodeRabbit

Release Notes

New Features
- SBOM scanning now handles missing workload identification data gracefully, retrying with exponential backoff instead of immediate failure
- Maximum retry limits prevent infinite retry loops
Tests
- Substantially improved test coverage for SBOM event processing, validation, and retry behavior

…nqueue with bounded exponential backoff Signed-off-by: yugal07 <yashsadhwani544@gmail.com>

coderabbitai · 2026-05-27T12:58:20Z

📝 Walkthrough

Walkthrough

This PR implements bounded exponential backoff retry for SBOM events when Workload ID (WLID) data is not yet available. When validation fails due to missing WLID, events are re-enqueued after a configurable delay that doubles per attempt, up to a maximum retry limit; on exhaustion, the error is emitted and retry state is cleaned up to prevent leaks.

Changes

SBOM WLID Retry and Backoff Support

Layer / File(s)	Summary
Error contract and retry state tracking `watcher/watchhandler.go`, `watcher/sbomwatcher.go`	`ErrMissingWlid` alias and `sbomRetryAttempts` map added to `WatchHandler` for tracking per-SBOM retry counts; `errors` package imported to support error comparison.
Retry configuration and validation contract `watcher/sbomwatcher.go`	Package-level retry bounds (`sbomRetryInitialDelay`, `sbomRetryMaxDelay`, `sbomRetryMaxAttempts`) define exponential backoff behavior; `validateContainerData` now requires non-empty `Wlid` field; `sbomRetryBackoff(attempt)` computes delay via left-shift doubling capped at maximum.
SBOM event retry and backoff processing `watcher/sbomwatcher.go`	When validation fails with `ErrMissingWlid`, event is re-enqueued after backoff-computed delay with incremented attempt counter; once retry limit exhausted, error is emitted and retry state cleared; on successful validation, retry bookkeeping for that SBOM is cleared.
Test infrastructure and retry scenario coverage `watcher/sbomwatcher_test.go`	Test constants and refactored `TestHandleSBOMEvents` with `seedContainerData` support; new helper `newTestHandler`; assertions on storage objects and command dispatch; new tests validate container data gate, exponential backoff schedule, WLID-arrives-late scenario (eventual dispatch with cleanup), and WLID-never-arrives scenario (bounded exhaustion with no malformed commands and retry cleanup).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Operator SBOM watcher dispatches TypeScanImages commands with empty Wlid when SBOM event beats Pod event, vuln results lose workload attribution #378: This PR directly implements the retrieved issue's suggested fixes—adding ErrMissingWlid, requiring non-empty Wlid in validateContainerData, and implementing per-SBOM retry/backoff with bookkeeping—to handle delayed WLID availability in SBOM events.

Poem

🐰 A tale of retries and backoff so fair,
When WLID arrives just a bit slow through the air,
We wait and we double, with patience and grace,
Till data arrives and completes the embrace! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Fix empty-Wlid scan dispatch in operator SBOM watcher' directly and clearly describes the main change: fixing incorrect dispatch behavior when WLID is empty, which is the core issue addressed by the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Tools execution failed with the following error:

Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

watcher/watchhandler.go (1)
41-49: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Initialize sbomRetryAttempts in NewWatchHandler (avoid maps.SafeMap zero-value).

goradd/maps SafeMap is not safe to use from its zero value (internal map is nil; it’s intended to be constructed via new/constructor). Since NewWatchHandler doesn’t initialize sbomRetryAttempts, it can panic when accessed later in watcher/watchhandler.go.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@watcher/watchhandler.go` around lines 41 - 49, NewWatchHandler currently
leaves the WatchHandler.sbomRetryAttempts as the SafeMap zero-value which can
panic; in NewWatchHandler initialize sbomRetryAttempts using the SafeMap
constructor (e.g. maps.NewSafeMap[KeyType,ValueType]() or
new(maps.SafeMap[KeyType,ValueType]) depending on project conventions) so the
internal map is allocated before use; update the constructor in NewWatchHandler
to set sbomRetryAttempts to the newly created SafeMap alongside WlidAndImageID
and other fields and reference the WatchHandler struct and sbomRetryAttempts
field when making the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@watcher/watchhandler.go`:
- Around line 41-49: NewWatchHandler currently leaves the
WatchHandler.sbomRetryAttempts as the SafeMap zero-value which can panic; in
NewWatchHandler initialize sbomRetryAttempts using the SafeMap constructor (e.g.
maps.NewSafeMap[KeyType,ValueType]() or new(maps.SafeMap[KeyType,ValueType])
depending on project conventions) so the internal map is allocated before use;
update the constructor in NewWatchHandler to set sbomRetryAttempts to the newly
created SafeMap alongside WlidAndImageID and other fields and reference the
WatchHandler struct and sbomRetryAttempts field when making the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a655ae84-11fb-4a76-b4a1-ec10663500a6

📥 Commits

Reviewing files that changed from the base of the PR and between 8ca96a7 and 813f946.

📒 Files selected for processing (3)

watcher/sbomwatcher.go
watcher/sbomwatcher_test.go
watcher/watchhandler.go

yugal07 · 2026-05-27T13:03:17Z

@coderabbitai review

coderabbitai · 2026-05-27T13:03:27Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

matthyx

perfect, thanks @yugal07

fix(sbomwatcher): reject empty Wlid in validateContainerData and re-e…

813f946

…nqueue with bounded exponential backoff Signed-off-by: yugal07 <yashsadhwani544@gmail.com>

matthyx added this to KS PRs tracking May 27, 2026

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

matthyx moved this to Needs Reviewer in KS PRs tracking May 27, 2026

matthyx approved these changes May 27, 2026

View reviewed changes

matthyx added the release Create release label May 27, 2026

matthyx merged commit a5313f3 into kubescape:main May 27, 2026
11 checks passed

matthyx moved this from Needs Reviewer to To Archive in KS PRs tracking May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix empty-Wlid scan dispatch in operator SBOM watcher#379

Fix empty-Wlid scan dispatch in operator SBOM watcher#379
matthyx merged 1 commit into
kubescape:mainfrom
yugal07:378

yugal07 commented May 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

yugal07 commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Uh oh!

matthyx left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yugal07 commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Fixes

Reproduction

Test plan

Notes for reviewers

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yugal07 commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Uh oh!

matthyx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yugal07 commented May 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading