Skip to content

fix: bypass informer cache for ClowdApp read in CJI controller#1760

Open
rodrigonull wants to merge 1 commit intoRedHatInsights:masterfrom
rodrigonull:fix/cji-bypass-informer-cache
Open

fix: bypass informer cache for ClowdApp read in CJI controller#1760
rodrigonull wants to merge 1 commit intoRedHatInsights:masterfrom
rodrigonull:fix/cji-bypass-informer-cache

Conversation

@rodrigonull
Copy link
Copy Markdown
Member

Problem

PR #1742 introduced a generation check to prevent runOnNotReady CJI jobs from using stale ClowdApp specs. However, the fix was insufficient because the CJI controller reads the ClowdApp from the informer cache, which can serve an entirely stale but internally consistent object — where both metadata.generation and status.generation match at old values. The generation check passes, and the job is created with the old image.

We confirmed this in stage: CJI run-db-migrations-90cb9b7 was created for image 90cb9b7, but the resulting Job used old image 1f8810c. Both the CJI and Job share the same creation timestamp (22:01:45Z) — the controller never requeued, proving the cache served stale data that passed the generation check.

Solution

Replace r.Get() (informer cache) with r.APIReader.Get() (direct API server / etcd read) when fetching the ClowdApp in the CJI controller. This guarantees the controller sees the latest spec.

Combined with the existing generation check, the flow is now:

  1. APIReader reads ClowdApp from etcd → gets new spec (generation=N, status.generation=N-1)
  2. Generation check detects mismatch → requeue
  3. ClowdApp controller reconciles → status.generation=N
  4. CJI retries → generation matches → job created with correct image

Changes

  • clowdjobinvocation_controller.go: Add APIReader client.Reader field; use r.APIReader.Get() for the ClowdApp fetch
  • run.go: Wire mgr.GetAPIReader() into the CJI reconciler
  • clowdjobinvocation_controller_test.go: New unit tests covering generation check (match, mismatch, backward compat), APIReader vs cache behavior, missing app, and early-exit paths
  • 02-json-asserts.sh: Fix namespace and path bugs introduced in fix: prevent CJI with runOnNotReady from using stale job image #1742

Test plan

  • go vet ./controllers/cloud.redhat.com/... passes
  • go test -run TestCJI ./controllers/cloud.redhat.com/ — all 5 tests pass
  • make test — full suite passes (pre-existing covdata tool error only)

The CJI controller was reading the ClowdApp from the informer cache,
which could serve a stale but internally consistent object (where both
metadata.generation and status.generation matched at old values). This
caused runOnNotReady CJI jobs to be created with an old image when a
ClowdApp update and CJI creation were applied simultaneously.

Switch to using an APIReader (direct etcd read) for the ClowdApp fetch
so the controller always sees the latest spec. Combined with the
existing generation check, this ensures the CJI waits for the ClowdApp
controller to reconcile the new spec before creating jobs.

Also fixes namespace and path bugs in the KUTTL test script
02-json-asserts.sh introduced in the previous PR.
@rodrigonull rodrigonull force-pushed the fix/cji-bypass-informer-cache branch from 65f71b7 to 72842c0 Compare April 9, 2026 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant