fix: bypass informer cache for ClowdApp read in CJI controller#1760
Open
rodrigonull wants to merge 1 commit intoRedHatInsights:masterfrom
Open
fix: bypass informer cache for ClowdApp read in CJI controller#1760rodrigonull wants to merge 1 commit intoRedHatInsights:masterfrom
rodrigonull wants to merge 1 commit intoRedHatInsights:masterfrom
Conversation
The CJI controller was reading the ClowdApp from the informer cache, which could serve a stale but internally consistent object (where both metadata.generation and status.generation matched at old values). This caused runOnNotReady CJI jobs to be created with an old image when a ClowdApp update and CJI creation were applied simultaneously. Switch to using an APIReader (direct etcd read) for the ClowdApp fetch so the controller always sees the latest spec. Combined with the existing generation check, this ensures the CJI waits for the ClowdApp controller to reconcile the new spec before creating jobs. Also fixes namespace and path bugs in the KUTTL test script 02-json-asserts.sh introduced in the previous PR.
65f71b7 to
72842c0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
PR #1742 introduced a generation check to prevent
runOnNotReadyCJI jobs from using stale ClowdApp specs. However, the fix was insufficient because the CJI controller reads the ClowdApp from the informer cache, which can serve an entirely stale but internally consistent object — where bothmetadata.generationandstatus.generationmatch at old values. The generation check passes, and the job is created with the old image.We confirmed this in stage: CJI
run-db-migrations-90cb9b7was created for image90cb9b7, but the resulting Job used old image1f8810c. Both the CJI and Job share the same creation timestamp (22:01:45Z) — the controller never requeued, proving the cache served stale data that passed the generation check.Solution
Replace
r.Get()(informer cache) withr.APIReader.Get()(direct API server / etcd read) when fetching the ClowdApp in the CJI controller. This guarantees the controller sees the latest spec.Combined with the existing generation check, the flow is now:
generation=N,status.generation=N-1)status.generation=NChanges
clowdjobinvocation_controller.go: AddAPIReader client.Readerfield; user.APIReader.Get()for the ClowdApp fetchrun.go: Wiremgr.GetAPIReader()into the CJI reconcilerclowdjobinvocation_controller_test.go: New unit tests covering generation check (match, mismatch, backward compat), APIReader vs cache behavior, missing app, and early-exit paths02-json-asserts.sh: Fix namespace and path bugs introduced in fix: prevent CJI with runOnNotReady from using stale job image #1742Test plan
go vet ./controllers/cloud.redhat.com/...passesgo test -run TestCJI ./controllers/cloud.redhat.com/— all 5 tests passmake test— full suite passes (pre-existingcovdatatool error only)