Skip to content

test: add openshift e2e smoke test#202

Draft
GrigoryPervakov wants to merge 15 commits into
mainfrom
okd-test
Draft

test: add openshift e2e smoke test#202
GrigoryPervakov wants to merge 15 commits into
mainfrom
okd-test

Conversation

@GrigoryPervakov
Copy link
Copy Markdown
Member

Why

Need to verify OpenShift compatibility

What

Add a special OpenShift e2e test against the pre-provided OKD cluster

Bounce catalog-operator + olm-operator deployments in the workflow's
health-check step. The libvirt snapshot revert leaves OLM informers in
a state where new CatalogSources are never reconciled (status remains
empty for 5+ minutes). Forcing fresh pod startups before the test runs
clears that stale state.
OpenShift's namespace controller auto-fills uid-range, supplemental-groups
and sa.scc.mcs together on namespace creation. Setting only two of three
skips the backfill and the catalog registry pod is rejected by SCC
admission with 'unable to find annotation openshift.io/sa.scc.mcs'.

Let the controller assign all three.
The FBC catalog at ghcr.io/clickhouse/clickhouse-operator-catalog:latest
is built with GenerateMajorChannels=true, so the only channel is
'stable-v0', not 'stable'. Subscription with channel: stable was failing
resolution with 'no operators found in channel stable of package
clickhouse-operator'.
Operator now installs into clickhouse-operator-olm and watches a separate
clickhouse-operator-test namespace. Workload CRs (KeeperCluster,
ClickHouseCluster) land in the test namespace via testDeployment.

Also fans diagnostic dumps over both namespaces.
The public catalog ships v0.0.5, which lacks the operator-side emptyDir
fallback (introduced in dab3ace, awaiting release). Inject emptyDir at
the data path via PodTemplate.Volumes and ContainerTemplate.VolumeMounts
so keeper and clickhouse pods can write under restricted SCC. Drop the
override once the catalog ships >= v0.0.6.
The runner image no longer ships okd-revert.sh — the long-lived OKD
cluster is wiped per-run via okd-cleanup.sh (namespace + OLM-object
delete; cluster stays running and keeps kubelet certs fresh) and
destroy-+-reinstalled weekly via okd-rebuild.sh.

- openshift-compatibility{,-pr}.yaml: drop the snapshot-revert + the
  control-plane-probe step (it was a post-revert KCM-recovery probe;
  no longer applicable). Call okd-cleanup.sh; it does its own CO
  health check at the tail so the inline 'Verify cluster health' is
  redundant.
- openshift-runner-rebuild.yaml: new weekly cron + workflow_dispatch
  on the self-hosted runner; shares the 'openshift-compatibility'
  concurrency group so it can't race a test job. Calls okd-rebuild.sh
  with --preserve-env=GITHUB_ACTIONS so the script's
  runner-service-stop guard kicks in.
The catalog at :latest now carries both stable-v0 (release bundles) and
fast-v0 (per-commit main builds) since PR #211 landed. Pointing the
openshift e2e at fast-v0 means the test exercises the actual current
main, not the last release — the original intent of the openshift
compatibility check.
okd-cleanup.sh + okd-rebuild.sh are committed in the runner repo but
aren't on the live runner image yet — they need a Packer build + roll.
Until then the snapshot-revert path is still the only thing on the
runner; restore the workflow call so the fast-v0 channel switch can
actually run. Switch back to okd-cleanup.sh after the runner image is
refreshed.
The OpenShift compatibility e2e runs against a fresh OKD cluster where
docker.io/clickhouse/clickhouse-server:26.3 (~600MB) and
docker.io/clickhouse/clickhouse-keeper:26.3 are cold-pulled at
pod-creation time. A 5-minute kubectl wait for ClickHouseCluster Ready
expires before the version-probe Job's image-pull completes.

Kind-based e2e shards pre-load images via 'kind load docker-image' so
the cold-pull case doesn't apply there; the bumped timeout is just a
ceiling, not a slowdown — happy-path tests still finish in seconds.
Previously the OpenShift compat e2e was a separate workflow that ran
after each main commit (workflow_run trigger). PRs got coverage via a
temporary openshift-compatibility-pr.yaml that mirrored the main-branch
shape — but a PR with a green main-branch compat run was no guarantee
the PR's actual diff worked against the freshly-published fast catalog.

New shape:
- ci.yaml gains an 'openshift-compat' job that needs all the regular
  Operator CI gates (lint, bundle, build_and_test, fuzz_specs,
  helm-test, compat-e2e-test, e2e-test, check-crd-compat) and runs only
  when they're green. Same-repo-only — fork PRs cannot reach the
  self-hosted runner.
- continue-on-error: true; the job is intentionally NOT a dependency of
  ci-success-check. The OKD self-hosted runner can be down for cluster
  maintenance (weekly rebuild) without blocking PR merges.
- openshift-compatibility.yaml is now workflow_dispatch-only — manual
  reruns against the runner without re-driving the full Operator CI
  matrix.
- openshift-compatibility-pr.yaml deleted (its own comment said to drop
  it once the harness verified).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant