Fix empty-Wlid scan dispatch in operator SBOM watcher#379
Conversation
…nqueue with bounded exponential backoff Signed-off-by: yugal07 <yashsadhwani544@gmail.com>
📝 WalkthroughWalkthroughThis PR implements bounded exponential backoff retry for SBOM events when Workload ID (WLID) data is not yet available. When validation fails due to missing WLID, events are re-enqueued after a configurable delay that doubles per attempt, up to a maximum retry limit; on exhaustion, the error is emitted and retry state is cleaned up to prevent leaks. ChangesSBOM WLID Retry and Backoff Support
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Tools execution failed with the following error: Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error) Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
watcher/watchhandler.go (1)
41-49:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winInitialize
sbomRetryAttemptsinNewWatchHandler(avoidmaps.SafeMapzero-value).
goradd/mapsSafeMapis not safe to use from its zero value (internal map is nil; it’s intended to be constructed vianew/constructor). SinceNewWatchHandlerdoesn’t initializesbomRetryAttempts, it can panic when accessed later inwatcher/watchhandler.go.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@watcher/watchhandler.go` around lines 41 - 49, NewWatchHandler currently leaves the WatchHandler.sbomRetryAttempts as the SafeMap zero-value which can panic; in NewWatchHandler initialize sbomRetryAttempts using the SafeMap constructor (e.g. maps.NewSafeMap[KeyType,ValueType]() or new(maps.SafeMap[KeyType,ValueType]) depending on project conventions) so the internal map is allocated before use; update the constructor in NewWatchHandler to set sbomRetryAttempts to the newly created SafeMap alongside WlidAndImageID and other fields and reference the WatchHandler struct and sbomRetryAttempts field when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@watcher/watchhandler.go`:
- Around line 41-49: NewWatchHandler currently leaves the
WatchHandler.sbomRetryAttempts as the SafeMap zero-value which can panic; in
NewWatchHandler initialize sbomRetryAttempts using the SafeMap constructor (e.g.
maps.NewSafeMap[KeyType,ValueType]() or new(maps.SafeMap[KeyType,ValueType])
depending on project conventions) so the internal map is allocated before use;
update the constructor in NewWatchHandler to set sbomRetryAttempts to the newly
created SafeMap alongside WlidAndImageID and other fields and reference the
WatchHandler struct and sbomRetryAttempts field when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a655ae84-11fb-4a76-b4a1-ec10663500a6
📒 Files selected for processing (3)
watcher/sbomwatcher.gowatcher/sbomwatcher_test.gowatcher/watchhandler.go
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Summary
When an
SBOMSyftevent was processed before the owning Pod had populatedImageToContainerData— either during the startup race (initial SBOM list paged before the pod informer warmed up) or unconditionally whenServiceDiscovery.Enabled == false— the operator dispatched aTypeScanImagescommand withWlid="".kubevulnthen ran the scan but itsif workload.Wlid != ""guard silently dropped the result from the platform CVE submission path. No log aboveinfowas emitted, so the lost attribution was invisible to operators.This PR closes the race at the source: empty
Wlidis now a validation failure, and SBOMs that hit it are parked and retried with bounded exponential backoff rather than dispatched as malformed scans.Changes
operator/watcher/watchhandler.goErrMissingWlid(alias of existingErrMissingWLID) for the new validation branch.sbomRetryAttempts maps.SafeMap[string, int]to track per-SBOM retry counts.operator/watcher/sbomwatcher.govalidateContainerDatanow returnsErrMissingWlidwhenWlid == "".HandleSBOMEventstreatsErrMissingWlidas transient: re-enqueues the SBOM after a delay that doubles each attempt (2s → 4s → 8s → … capped at 60s) for up to 5 attempts, then drops with awarninglog +ErrMissingWlidonerrorCh. Retry bookkeeping is cleared on success and on exhaustion to avoid leaks.vars (sbomRetryInitialDelay,sbomRetryMaxDelay,sbomRetryMaxAttempts) so tests can shorten them.operator/watcher/sbomwatcher_test.goTestHandleSBOMEventsto seedImageToContainerDataso the existing cases now assert the correctWlidis carried on the command (the priorexpectedCommandsliterally encoded the bug — emptyWlidon dispatched commands).TestValidateContainerData— table-driven coverage of all four validation branches, including the#378regression case.TestSBOMRetryBackoff— verifies the doubling schedule, themaxDelaycap, and that absurd attempt counts (e.g. 64, which would overflow the shift) stay capped instead of returning a non-positive duration.TestHandleSBOMEvents_WlidArrivesLate— reproduces the race: SBOM arrives first with empty map; pod data lands mid-retry; asserts the eventual scan command carries the correctWlidand that retry bookkeeping is cleared.TestHandleSBOMEvents_WlidNeverArrives_ExhaustsRetries— pod data never arrives; asserts retries are bounded,ErrMissingWlidis surfaced exactly once, no scan command is ever dispatched, and retry bookkeeping is cleared.Fixes
#378
Reproduction
Verified end-to-end with a temporary repro test that uses only the pre-fix API:
Wlid=""The repro file was deleted after verification; the two new permanent regression tests above cover the same paths.
Test plan
go build ./...cleango vet ./watcher/cleango test ./watcher/ -count=1— full watcher suite green (~91s)Notes for reviewers
SBOMWatchand waiting on a pod-informer sync signal that doesn't exist today. The current fix is sufficient to eliminate the empty-Wliddispatch and is independently valuable even if a later PR also fixes the ordering. I'd dedicate another few days on that long term solution after my exams 😄ErrMissingWlidis intentionally aliased to the existingErrMissingWLIDrather than introduced as a separate error, so callers checking either spelling will work.sbomwatcher.go:82(which the issue called out specifically) is now correct in spirit: the race still exists at theselect, but it no longer produces malformed scans, the validator + retry loop absorb it.Summary by CodeRabbit
Release Notes
New Features
Tests