Skip to content

Fix empty-Wlid scan dispatch in operator SBOM watcher#379

Merged
matthyx merged 1 commit into
kubescape:mainfrom
yugal07:378
May 27, 2026
Merged

Fix empty-Wlid scan dispatch in operator SBOM watcher#379
matthyx merged 1 commit into
kubescape:mainfrom
yugal07:378

Conversation

@yugal07
Copy link
Copy Markdown
Contributor

@yugal07 yugal07 commented May 27, 2026

Summary

When an SBOMSyft event was processed before the owning Pod had populated ImageToContainerData — either during the startup race (initial SBOM list paged before the pod informer warmed up) or unconditionally when ServiceDiscovery.Enabled == false — the operator dispatched a TypeScanImages command with Wlid="". kubevuln then ran the scan but its if workload.Wlid != "" guard silently dropped the result from the platform CVE submission path. No log above info was emitted, so the lost attribution was invisible to operators.

This PR closes the race at the source: empty Wlid is now a validation failure, and SBOMs that hit it are parked and retried with bounded exponential backoff rather than dispatched as malformed scans.

Changes

operator/watcher/watchhandler.go

  • Added ErrMissingWlid (alias of existing ErrMissingWLID) for the new validation branch.
  • Added sbomRetryAttempts maps.SafeMap[string, int] to track per-SBOM retry counts.

operator/watcher/sbomwatcher.go

  • validateContainerData now returns ErrMissingWlid when Wlid == "".
  • HandleSBOMEvents treats ErrMissingWlid as transient: re-enqueues the SBOM after a delay that doubles each attempt (2s → 4s → 8s → … capped at 60s) for up to 5 attempts, then drops with a warning log + ErrMissingWlid on errorCh. Retry bookkeeping is cleared on success and on exhaustion to avoid leaks.
  • Backoff parameters live in package-level vars (sbomRetryInitialDelay, sbomRetryMaxDelay, sbomRetryMaxAttempts) so tests can shorten them.

operator/watcher/sbomwatcher_test.go

  • Updated TestHandleSBOMEvents to seed ImageToContainerData so the existing cases now assert the correct Wlid is carried on the command (the prior expectedCommands literally encoded the bug — empty Wlid on dispatched commands).
  • New TestValidateContainerData — table-driven coverage of all four validation branches, including the #378 regression case.
  • New TestSBOMRetryBackoff — verifies the doubling schedule, the maxDelay cap, and that absurd attempt counts (e.g. 64, which would overflow the shift) stay capped instead of returning a non-positive duration.
  • New TestHandleSBOMEvents_WlidArrivesLate — reproduces the race: SBOM arrives first with empty map; pod data lands mid-retry; asserts the eventual scan command carries the correct Wlid and that retry bookkeeping is cleared.
  • New TestHandleSBOMEvents_WlidNeverArrives_ExhaustsRetries — pod data never arrives; asserts retries are bounded, ErrMissingWlid is surfaced exactly once, no scan command is ever dispatched, and retry bookkeeping is cleared.

Fixes

#378

Reproduction

Verified end-to-end with a temporary repro test that uses only the pre-fix API:

Code state Repro outcome Meaning
Unfixed (fix stashed) PASS — asserts dispatched command has Wlid="" Bug reproduced: malformed scan dispatched
Fixed (this PR) FAIL — "no scan command and no error within 2s — handler is parking the event" SBOM is parked for retry instead of dispatched as malformed

The repro file was deleted after verification; the two new permanent regression tests above cover the same paths.

Test plan

  • go build ./... clean
  • go vet ./watcher/ clean
  • go test ./watcher/ -count=1 — full watcher suite green (~91s)
  • Manually reproduced the bug against unfixed code (repro PASS) and confirmed the fix closes it (repro FAIL, with diagnostic "handler is parking the event")

Notes for reviewers

  • The deeper "list pods first, then SBOMs" startup-ordering fix suggested in the issue is not included here, it would require restructuring SBOMWatch and waiting on a pod-informer sync signal that doesn't exist today. The current fix is sufficient to eliminate the empty-Wlid dispatch and is independently valuable even if a later PR also fixes the ordering. I'd dedicate another few days on that long term solution after my exams 😄
  • ErrMissingWlid is intentionally aliased to the existing ErrMissingWLID rather than introduced as a separate error, so callers checking either spelling will work.
  • The FIXME at sbomwatcher.go:82 (which the issue called out specifically) is now correct in spirit: the race still exists at the select, but it no longer produces malformed scans, the validator + retry loop absorb it.

Summary by CodeRabbit

Release Notes

  • New Features

    • SBOM scanning now handles missing workload identification data gracefully, retrying with exponential backoff instead of immediate failure
    • Maximum retry limits prevent infinite retry loops
  • Tests

    • Substantially improved test coverage for SBOM event processing, validation, and retry behavior

Review Change Stack

…nqueue with bounded exponential backoff

Signed-off-by: yugal07 <yashsadhwani544@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

📝 Walkthrough

Walkthrough

This PR implements bounded exponential backoff retry for SBOM events when Workload ID (WLID) data is not yet available. When validation fails due to missing WLID, events are re-enqueued after a configurable delay that doubles per attempt, up to a maximum retry limit; on exhaustion, the error is emitted and retry state is cleaned up to prevent leaks.

Changes

SBOM WLID Retry and Backoff Support

Layer / File(s) Summary
Error contract and retry state tracking
watcher/watchhandler.go, watcher/sbomwatcher.go
ErrMissingWlid alias and sbomRetryAttempts map added to WatchHandler for tracking per-SBOM retry counts; errors package imported to support error comparison.
Retry configuration and validation contract
watcher/sbomwatcher.go
Package-level retry bounds (sbomRetryInitialDelay, sbomRetryMaxDelay, sbomRetryMaxAttempts) define exponential backoff behavior; validateContainerData now requires non-empty Wlid field; sbomRetryBackoff(attempt) computes delay via left-shift doubling capped at maximum.
SBOM event retry and backoff processing
watcher/sbomwatcher.go
When validation fails with ErrMissingWlid, event is re-enqueued after backoff-computed delay with incremented attempt counter; once retry limit exhausted, error is emitted and retry state cleared; on successful validation, retry bookkeeping for that SBOM is cleared.
Test infrastructure and retry scenario coverage
watcher/sbomwatcher_test.go
Test constants and refactored TestHandleSBOMEvents with seedContainerData support; new helper newTestHandler; assertions on storage objects and command dispatch; new tests validate container data gate, exponential backoff schedule, WLID-arrives-late scenario (eventual dispatch with cleanup), and WLID-never-arrives scenario (bounded exhaustion with no malformed commands and retry cleanup).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Poem

🐰 A tale of retries and backoff so fair,
When WLID arrives just a bit slow through the air,
We wait and we double, with patience and grace,
Till data arrives and completes the embrace! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix empty-Wlid scan dispatch in operator SBOM watcher' directly and clearly describes the main change: fixing incorrect dispatch behavior when WLID is empty, which is the core issue addressed by the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Tools execution failed with the following error:

Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
watcher/watchhandler.go (1)

41-49: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Initialize sbomRetryAttempts in NewWatchHandler (avoid maps.SafeMap zero-value).

goradd/maps SafeMap is not safe to use from its zero value (internal map is nil; it’s intended to be constructed via new/constructor). Since NewWatchHandler doesn’t initialize sbomRetryAttempts, it can panic when accessed later in watcher/watchhandler.go.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@watcher/watchhandler.go` around lines 41 - 49, NewWatchHandler currently
leaves the WatchHandler.sbomRetryAttempts as the SafeMap zero-value which can
panic; in NewWatchHandler initialize sbomRetryAttempts using the SafeMap
constructor (e.g. maps.NewSafeMap[KeyType,ValueType]() or
new(maps.SafeMap[KeyType,ValueType]) depending on project conventions) so the
internal map is allocated before use; update the constructor in NewWatchHandler
to set sbomRetryAttempts to the newly created SafeMap alongside WlidAndImageID
and other fields and reference the WatchHandler struct and sbomRetryAttempts
field when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@watcher/watchhandler.go`:
- Around line 41-49: NewWatchHandler currently leaves the
WatchHandler.sbomRetryAttempts as the SafeMap zero-value which can panic; in
NewWatchHandler initialize sbomRetryAttempts using the SafeMap constructor (e.g.
maps.NewSafeMap[KeyType,ValueType]() or new(maps.SafeMap[KeyType,ValueType])
depending on project conventions) so the internal map is allocated before use;
update the constructor in NewWatchHandler to set sbomRetryAttempts to the newly
created SafeMap alongside WlidAndImageID and other fields and reference the
WatchHandler struct and sbomRetryAttempts field when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a655ae84-11fb-4a76-b4a1-ec10663500a6

📥 Commits

Reviewing files that changed from the base of the PR and between 8ca96a7 and 813f946.

📒 Files selected for processing (3)
  • watcher/sbomwatcher.go
  • watcher/sbomwatcher_test.go
  • watcher/watchhandler.go

@yugal07
Copy link
Copy Markdown
Contributor Author

yugal07 commented May 27, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@matthyx matthyx moved this to Needs Reviewer in KS PRs tracking May 27, 2026
Copy link
Copy Markdown
Contributor

@matthyx matthyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect, thanks @yugal07

@matthyx matthyx added the release Create release label May 27, 2026
@matthyx matthyx merged commit a5313f3 into kubescape:main May 27, 2026
11 checks passed
@matthyx matthyx moved this from Needs Reviewer to To Archive in KS PRs tracking May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release Create release

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants