feat: long-haul (canary) test driver for the operator (#220)#348
feat: long-haul (canary) test driver for the operator (#220)#348WentingWu666666 wants to merge 35 commits intodocumentdb:mainfrom
Conversation
ca9522c to
d1b3462
Compare
d1b3462 to
6bd4274
Compare
There was a problem hiding this comment.
Pull request overview
Adds a long-haul (run-until-failure canary) testing framework to the DocumentDB Kubernetes Operator repo, including a design document and an initial Phase 1a skeleton that is CI-safe by default.
Changes:
- Introduces a long-haul Ginkgo suite with a
BeforeSuiteenablement gate (LONGHAUL_ENABLED) and a placeholder canary spec. - Adds a
test/longhaul/configpackage for env-based configuration loading/validation with unit tests. - Documents the long-haul test architecture and usage via a new design doc and package README.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| operator/src/test/longhaul/suite_test.go | Adds the Ginkgo suite entry point for long-haul tests. |
| operator/src/test/longhaul/longhaul_test.go | Adds gated BeforeSuite config loading/validation and a placeholder canary spec. |
| operator/src/test/longhaul/config/suite_test.go | Adds the Ginkgo suite entry point for config unit tests. |
| operator/src/test/longhaul/config/config_test.go | Adds unit tests for config defaults, env parsing, validation, and enablement gating. |
| operator/src/test/longhaul/config/config.go | Implements long-haul test configuration (env vars, defaults, validation, enablement). |
| operator/src/test/longhaul/README.md | Adds usage docs, configuration reference, and CI-safety notes for long-haul tests. |
| docs/designs/long-haul-test-design.md | Adds the long-haul canary design document and phased implementation plan. |
| // Config holds all configuration for a long haul test run. | ||
| type Config struct { | ||
| // MaxDuration is the maximum test duration. Zero means run until failure. | ||
| // Requires explicit LONGHAUL_MAX_DURATION=0 to enable infinite runs. |
There was a problem hiding this comment.
This comment says infinite runs require LONGHAUL_MAX_DURATION=0, but time.ParseDuration requires a unit (and the README/tests use 0s). Update the comment to match the actual accepted value to avoid confusing users.
| // Requires explicit LONGHAUL_MAX_DURATION=0 to enable infinite runs. | |
| // Requires explicit LONGHAUL_MAX_DURATION=0s to enable infinite runs. |
xgerman
left a comment
There was a problem hiding this comment.
Review
Verdict: request-changes — lightly. No blocking issues for the skeleton itself; the skip gate is airtight and the config tests are appropriate. But there are four structural items the Phase 1a framing bakes in that will be expensive to unwind in Phase 1b/1c if not settled now.
Verified first: test-unit.yml uses operator/src as working-dir and runs go list ./..., which enumerates the new packages. With LONGHAUL_ENABLED unset the BeforeSuite calls Skip() and the suite registers as a single "skipped" line. Zero CI overhead. ✅
🟠 Major
M1. Module placement will pollute operator/src/go.mod — operator/src/test/longhaul/ (entire tree)
The existing test/e2e/ is a separate Go module at test/e2e/go.mod — a deliberate isolation so mongo-driver/v2, CNPG test utilities, and other test-only deps never enter the operator's dependency graph. This PR puts longhaul inside the operator module.
For Phase 1a (stdlib + Ginkgo/Gomega, already in the operator go.mod) this is harmless. For the phases this PR's design describes:
- Phase 1b adds
go.mongodb.org/mongo-driver(design line 75) - Phase 1c adds a file-backed journal
- Phase 1d adds OTel / Prometheus client libs
- Phase 2f adds k8s Job manifests + likely controller-runtime client
…all would land in operator/src/go.mod and be forced onto every operator build. Resolve before the first non-stdlib dep is introduced. Preferred: move to test/longhaul/ with its own go.mod (mirrors e2e). Alternative: nest a go.mod under operator/src/test/longhaul/ and document the convention in CONTRIBUTING.md.
M2. Design does not engage with the existing e2e suite — docs/designs/long-haul-test-design.md throughout
Neither the PR description nor the design doc reference docs/designs/e2e-test-suite.md or test/e2e/. That suite already ships primitives long-haul will need:
- Ginkgo v2 harness with
SynchronizedBeforeSuitefor cluster bootstrapping (test/e2e/pkg/testenv) - Label taxonomy (
test/e2e/labels.go) - Run-ID tagging for artifact separation (
test/e2e/runid.go) - envsubst-based manifest rendering
- A mongo-driver v2 wire-protocol client (same driver the design proposes)
.github/actions/setup-test-environmentcomposite action for cluster provisioning
The "Directory Structure" section (lines 213–244) proposes reinventing monitor/, config/, suite bootstrap, and mongo client from scratch. The design should commit to either reuse — most naturally by making longhaul a sibling module that depends on test/e2e/pkg/ — or explicitly enumerate what it forks and why. Otherwise, six months from now we'll have two drifting implementations of "connect to a DocumentDB CR and open a mongo client" with different flag names, different retry policies, and different label conventions.
M3. Four design gaps that will cost rework, not captured in §Open Questions — long-haul-test-design.md:443-445
- Event journal durability across process/pod restarts. Line 239 says "In-memory + file-backed" but never says where the file lives. If the canary runs as a Kubernetes Job (line 305), the pod's ephemeral filesystem disappears on restart — and Phase 2e explicitly plans for operator restart / pod eviction chaos. The journal is the only post-mortem artifact and it will be destroyed by the very events it is supposed to record. Commit to a PVC mount or an external sink (object storage, Loki) in the design.
- Workload resumption idempotency. Writers use a unique index on
(writer_id, seq)(line 80). If a writer or the Job pod restarts, where doesseqrestart from? Reset to 0 → writes collide. Bootstrap frommax(seq)perwriter_id→ the oracle must tolerate a gap between crash and resume without flagging data loss. Neither path is described. - Teardown on abort. "On failure: collect artifacts, preserve cluster state" (line 188) handles the fatal path. But Ctrl-C, CI cancellation, or Job preemption mid-run leaves writer data in the target DB, possibly an in-progress restore cluster (Phase 2a), possibly a half-scaled topology (Phase 1e). No cleanup hooks, resumable state files, or leftover-run detector.
- Latency-regression baseline policy. Failure tiers list "latency regression" as a warning (line 197), but no baseline/threshold policy is defined. List this as an Open Question at minimum so Phase 1b doesn't invent a threshold policy in a PR review thread.
These shape the Phase 1b/1c interfaces and should be addressed (or explicitly deferred in §Open Questions) before the design doc merges.
M4. MaxDuration=30m default contradicts "run until failure" — config.go:594, long-haul-test-design.md:37,188
The design repeatedly commits to the Strimzi run-until-failure canary model. The Phase 1a default is 30m; the env doc (line 259) says 0s opts into infinite. Fine for local development — exactly wrong for the Phase 2f canary Job. Pick one: either (a) document that the canary Job manifest will set LONGHAUL_MAX_DURATION=0s explicitly and mention this in the README "CI Safety" section, or (b) flip the default to 0 and require local-dev to pass LONGHAUL_MAX_DURATION=10m.
🟡 Minor (cheap in-PR fixes)
- m1
IsEnabled()—config.go:637-640— doesn't trim whitespace.LONGHAUL_ENABLED=" true "→ false (silent skip). Addstrings.TrimSpace. Silent-failure mode; worth fixing in-PR. - m2
Validate()—config.go:625-633— accepts negativeMaxDuration.time.ParseDuration("-5m")succeeds. Addif c.MaxDuration < 0 { return fmt.Errorf(...) }. - m3 Error strings capitalized —
config.go:627,630—ST1005/ Go convention: lowercase. Matching tests useContainSubstring("Namespace")/"ClusterName"and will need adjustment. - m4 Path inconsistency —
long-haul-test-design.md:340sayskubectl apply -f test/longhaul/deploy/job.yamlbut the directory structure roots atoperator/src/test/longhaul/. Fix path, or resolve M1 first. - m5 README overstates CI safety —
README.md:543-547— phrase as "no CI workflow currently setsLONGHAUL_ENABLED; do not set it in any PR-gated workflow" rather than as a property of the code. - m6
configis a very generic package name. If Phase 2 ever imports the operator's own config, it will shadow. Considerlonghaulcfgorcfg. Not blocking. - m7 Missing
IsEnabledtest cases —config_test.go:727-757— given the silent-skip failure mode, lock it down: leading/trailing whitespace, mixed case ("True","YES"),"0"/"no"/"false"→ false.
🟢 Nit
longhaul_test.go:799— package-level mutabletestConfig. Fine for a single-file suite; revisit if Phase 1b splits files.DefaultConfig()by value vs.Validate()pointer receiver — slight asymmetry. Leave it.- README line 509's
-timeout 0guidance is correct and rarely documented — nice touch.
Positive feedback
- Skip gate is correctly placed in
BeforeSuite, notBeforeEach— single skipped line instead of N per spec. Exactly right. LoadFromEnvcleanly separates parsing from validation, soValidate()can be called on a programmatically-builtConfigin Phase 1b unit tests without fighting env vars.- The design doc's survey of Strimzi / CloudNative-PG / CockroachDB / Vitess and the "adopt / skip" table (lines 354–361) is unusually high-quality framing for an OSS design doc.
- Phased breakdown (Phase 1a → 3) is appropriately granular — each phase is demoable and reviewable.
- Test coverage on the config package is thorough for the surface area.
Questions for the author
- (M1) Any reason longhaul can't live at
test/longhaul/as a sibling module totest/e2e/? If there is one, please capture it in the design. - (M2) Have you looked at
test/e2e/pkg/for reusable primitives (typed clients, envsubst, runid, labels)? What's the plan for avoiding a second mongo-client implementation? - (M3.1) For the journal: PVC, external object store, or stdout-scraped-by-Loki?
- Issue #220 text isn't quoted in the PR — are there SLO targets, acceptance criteria, or a required timeframe (e.g., 72h)? Restating them in the design's Problem Statement would make future reviews easier.
- Should
CONTRIBUTING.mdorAGENTS.mdget a one-liner pointing atoperator/src/test/longhaul/README.mdso contributors aren't left wondering "what is this tree?"
Reviewed with the GitHub Copilot code-review agent.
f3ffbef to
8459cae
Compare
Response to @xgerman's ReviewThank you for the thorough review! I've addressed all items. Here's a summary: Major ItemsM1 (Module placement): Agreed extracted long-haul tests to M2 (E2E engagement): Designed a three-directory layout for test code reuse:
Each directory has its own M3 (Design gaps): Upgraded from Open Questions to Provisional Design Decisions in the design doc:
These lock the approach for Phase 1b/1c interfaces while leaving implementation details to those PRs. M4 (MaxDuration default): Added notes in both the design doc (Configuration section) and README (CI Safety section) that the persistent canary Job manifest explicitly sets Minor Items All Fixed
Questions
Test results: 23 config specs + 1 canary spec (skipped) = all passing. |
9676d24 to
158a393
Compare
|
The plan looks good. We should keep running parallel tests for different combination. Like, operator n version with gateway/extension n-1 version. Operator n-2 version with keeps working fine with latest non-schema changes. At least, different operator version tests, not only the latest release restarts the tests. |
|
Thanks for the feedback @hossain-rayhan! Great point about version compatibility. Since the operator and database images follow independent version tracks, we agree cross-version testing matters e.g., upgrading the operator while the database is still on the previous version, and vice versa. That said, we think this is better suited for the E2E test suite rather than long-haul. Cross-version compat bugs surface immediately on upgrade (deterministic, finite), while long-haul focuses on time-dependent issues memory leaks, connection pool exhaustion, replication drift that only appear after hours/days of sustained operation. We'll make sure the E2E suite covers both operator-first and database-first upgrade paths with version N/N-1 combinations. |
158a393 to
1d63aea
Compare
Previous run data caused unique index conflicts on (writer_id, seq). Now the writes collection is dropped at test start for a clean slate. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
… binary) Adds the infrastructure to run the long-haul test as a Kubernetes Job: - cmd/longhaul/main.go: standalone binary (no Ginkgo dependency) - Dockerfile: multi-stage distroless image - deploy/job.yaml: Job + ConfigMap for environment configuration - deploy/rbac.yaml: ServiceAccount (Phase 2 adds RBAC roles) - Updated README with deployment instructions The standalone binary is validated against AKS (1490 writes, 100% success). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…rts, alerting - Replace placeholder ClusterClient with real k8s implementation using unstructured dynamic client (no operator module dependency) - Add metrics-server integration for memory/CPU sampling with graceful fallback when metrics API is unavailable - Add periodic checkpoint reporter (configurable interval, default 1h) that writes to stdout and persists to ConfigMap - Add GitHub Actions workflow annotations (error/warning/notice) - Update RBAC with pod/CR/configmap access and ClusterRole for metrics - Update Dockerfile to golang:1.26-alpine for k8s client-go compatibility Validated on AKS: pod running healthy, all components initialized. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Polls AKS every 15 minutes. If the Job fails or ConfigMap reports FAIL, the workflow exits non-zero GitHub sends email to repo watchers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Add deploy/setup.yaml with namespace, credentials secret, and DocumentDB CR spec for reproducible long-haul test environments - Update job.yaml image to documentdblonghaul.azurecr.io registry - Update monitor workflow to target longhaul-aks cluster - Set max replicas to 1 (current webhook validation limit) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The Microsoft tenant blocks new service principal creation without a ServiceManagementReference. Switch to a base64-encoded kubeconfig stored as LONGHAUL_KUBECONFIG repo secret simpler and avoids the SP policy. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Change cron to hourly (matches LONGHAUL_REPORT_INTERVAL default 1h) - Add staleness check: fail if ConfigMap last-updated > 2h ago - Detects hung tests or bugs that stop updating the report Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Move staleness check before result check (stale data is untrustworthy) - Change 'Job not found' from notice to error (triggers notification) - Add instructions to disable workflow during maintenance Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The standalone binary (cmd/longhaul/main.go) is the primary entry point, deployed as a Kubernetes Job. The Ginkgo wrapper added no unique value. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Remove suite_test.go and longhaul_test.go from structure - Replace Ginkgo run instructions with 'go run ./cmd/longhaul/' - Remove LONGHAUL_ENABLED from config table - Add LONGHAUL_REPORT_INTERVAL to config table - Update CI safety section (no longer gated by LONGHAUL_ENABLED) - Update deploy file listing to match current state Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Add upgrade-if-needed job (runs only after health check passes) - Compare running versions against repo's Chart.yaml/values.yaml - Helm upgrade operator if appVersion drifts - Patch DocumentDB CR if documentDbVersion drifts - Ensures long-haul cluster always runs latest merged versions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Query ghcr.io for latest published operator and documentdb image tags - Fallback to Chart.yaml / values.yaml if GHCR API fails - Ensures cluster upgrades to actual released versions, not just repo HEAD Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Add schemaVersion: auto to setup.yaml so schema follows binary version - Upgrade step always sets schemaVersion=auto alongside documentDbVersion - Handle case where documentDbVersion is not set in CR (uses operator default) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…ction - Operator: /orgs/documentdb/packages/container/documentdb-operator/versions - DocumentDB: /orgs/documentdb/packages/container/documentdb-kubernetes-operator%2Fdocumentdb/versions - jq filter only matches strict X.Y.Z tags (rejects rc, latest, sha prefixes) - Keeps Chart.yaml/values.yaml as fallback if GHCR query fails Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Operator: rollout status + image tag re-check after helm upgrade - DocumentDB: poll status.documentDBImage every 20s (10m timeout) - Use status.documentDBImage instead of spec for running version detection - Fail workflow with notification if verification times out Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…to operator upgrade
The DocumentDB version upgrade is structurally identical to a scale operation
(spec patch on the DocumentDB CR -> operator reconciles -> rolling restart),
so it belongs in the test binary alongside scale/failover. This lets the
continuous data-integrity verifiers detect any corruption introduced by the
upgrade, and treats upgrade as a real load-bearing event rather than a
maintenance window where load is paused.
Workflow now:
- Helm-upgrades the operator (still infra-level work) and applies CRDs
first, since 'helm upgrade' does not touch CRDs by default.
- Publishes the desired DocumentDB version to a 'longhaul-versions'
ConfigMap instead of patching the CR directly.
Test now has an UpgradeDocumentDB operation:
- Precondition: ConfigMap version differs from status.documentDBImage.
- Execute: patch spec.documentDbVersion + spec.schemaVersion='auto', poll
status until the new image tag is observed, then WaitForSteadyState.
- Low weight so it doesn't crowd out scale/failover.
- Generous OutagePolicy because rolling restarts touch every pod.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The driver is restart-safe (resumes from PVC journal/checkpoint), so a
Deployment(replicas=1) lets us roll new images with kubectl set image
instead of delete-and-recreate.
- deploy/deployment.yaml replaces deploy/job.yaml
- kind: Deployment, replicas: 1, strategy: Recreate
- image is templated: __OWNER__ and __IMAGE_TAG__ are substituted by
the deploy workflow (or sed manually)
- .github/workflows/longhaul-deploy.yml (manual workflow_dispatch)
- Auths via LONGHAUL_KUBECONFIG secret (SA-token kubeconfig)
- Inputs: image_tag (required), apply_manifests (optional)
- kubectl apply (optional) -> set image -> rollout restart -> rollout status
- README runbook updated for Deployment workflow
On critical failure the driver still exits non-zero; the Deployment then
auto-restarts. Pod restart count and the report ConfigMap are the
sources of truth for `did the test fail'', not pod status alone.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Bootstrap (namespace, DocumentDB CR, credentials secret, driver SA + ClusterRole for metrics) is one-time admin work and lives in setup.yaml + rbac.yaml. The deployer SA used by the workflow is intentionally namespace-scoped and cannot create those, so trying to apply them in every deploy run failed with 'Forbidden'. Drop the apply_manifests input and unconditionally apply only deployment.yaml. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
* longhaul-deploy.yml: add workflow_run trigger so a successful build on main auto-deploys the matching :sha-<short> image. workflow_dispatch preserved for manual rolls/rollbacks. Resolve image tag from either trigger via a new steps.img.tag output. * longhaul-monitor.yaml: clarify in the header that the monitor only upgrades the operator directly; for DocumentDB it merely publishes the desired version to the longhaul-versions ConfigMap, and the long-haul driver performs the in-band upgrade as a load-aware operation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The monitor workflow had two pre-existing bugs that prevented it from ever passing after the long-haul auth + workload-shape changes landed: 1. Kubeconfig decode mismatch. The LONGHAUL_KUBECONFIG secret is stored as raw YAML (the deploy workflow writes it with printf), but the monitor decoded it with 'base64 -d', producing garbage and failing the very first kubectl call. 2. Job vs Deployment drift. The long-haul test was converted from a Job to a Deployment(replicas=1) so 'kubectl set image' can roll it. The monitor still ran 'kubectl get job longhaul-test' and tripped the 'No long-haul Job found' guard on every run. Switch the monitor to: - write the kubeconfig with printf (matches longhaul-deploy.yml), - export KUBECONFIG via GITHUB_ENV so subsequent steps find it, - check the Deployment's readyReplicas / Progressing condition instead of Job .status.active / Failed condition. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The golang:*-alpine base image already ships ca-certificates (it has to, otherwise 'go mod download' against proxy.golang.org would not work out of the box), and every dep in test/longhaul/go.mod resolves through the module proxy with no VCS fetches, so 'git' is unused. Removing the install shaves a layer and ~5-8s off cold builds. If we later add a 'replace' pointing at a Git URL or import a path that needs go-get HTML discovery, we'll need to add 'git' back. Verified locally with 'docker build .' build succeeds and produces a working binary in the distroless runtime stage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…de (documentdb#220) The README, design doc, and a couple of header comments in config.go were lagging behind the actual driver state shipped in this PR. This refresh aligns all three with what the code actually does today, so reviewers and future contributors do not have to cross-reference stale phase markers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…b#220) Address Copilot review feedback on PR documentdb#348: the `uses defaults when no env vars set'' spec previously only cleared three variables, so a developer with any other LONGHAUL_* var exported in their shell could see false failures. Move the clearing into a BeforeEach that wipes every recognized LONGHAUL_* env var. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…rade on HA (documentdb#220) Address xgerman's blockers on PR documentdb#348: [#1] ScaleUp/ScaleDown were patching spec.nodeCount, which the CRD pins to 1 (minimum=maximum=1). The API server rejected every scale call -- no real scale event was ever exercised and OpsExecuted over-counted. Switch all scale paths to spec.instancesPerNode (CRD range 1-3), the actual HA scale dimension. Add GetInstancesPerNode reader; remove the now-unused GetCurrentReplicas. Clamp Max to 3 / Min to 1 in the operation constructors. [#2] deploy/deployment.yaml exported LONGHAUL_MIN_INSTANCES / LONGHAUL_MAX_INSTANCES but config.LoadFromEnv only read LONGHAUL_MIN_REPLICAS / LONGHAUL_MAX_REPLICAS, so the deployment env vars were silently dropped and the driver fell back to defaults (which were also wrong: 2..5 vs the CRD's 1..3). Rename Env constants, struct fields, and config defaults (1..3) to match the deployment manifest. Add a Validate() check rejecting MaxInstances > 3. [#4] UpgradeDocumentDB.Precondition now skips when spec.instancesPerNode<2 with a clear message. A single-instance cluster has no standby, so a rolling upgrade WILL produce real (unavoidable) downtime; reporting that as a policy violation is a true positive but not actionable. The 10s scheduler tick re-evaluates as soon as the cluster scales up. Skipped operations do not consume the global cooldown (scheduler.tryExecute only updates lastOpTime after successful executeOp), so this guard is free. Bonus fixes found while refactoring: * spec.documentDbVersion -> spec.documentDBVersion (capital DB) -- the CRD's actual JSON tag. The old field was being silently ignored by the API server, so the upgrade operation never actually patched the version. (Caught by writing the GetInstancesPerNode unit-style read against the same CR object.) * deploy/setup.yaml: bump cluster to instancesPerNode: 2 (HA topology) so the upgrade operation has a standby to fail over to and actually fires on the canary. Fix the stale 'kubectl apply -f deploy/job.yaml' header comment to point at deployment.yaml. (xgerman nits documentdb#13 + documentdb#14.) * docs/designs/long-haul-test-design.md: rename MinReplicas/MaxReplicas references to MinInstances/MaxInstances and add the CRD bounds context so future readers understand why the range is 1-3, not arbitrary. * test/longhaul/config/config_test.go: update the BeforeEach env-clear list to the new constant names. Verified: go vet ./... clean; go test ./... -> ok config (19 specs). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…dDowntime TODO (documentdb#220) Sweep of xgerman's lower-severity findings on PR documentdb#348: [documentdb#8] journal.DisruptionWindow.ExceededPolicy() ignored Policy.AllowedDowntime even though every operation's OutagePolicy() sets it. Document the gap with a TODO(longhaul, documentdb#220) explaining what enforcement would require (per-write timestamps fed into the active window from writer.go) so the dead field is not mistaken for actual policy enforcement. [documentdb#9] UpgradeDocumentDB.waitForImage was 'continue'-ing on every status read error during the rolling restart, which made a sustained apiserver outage indistinguishable from a slow upgrade until the recovery timeout fired. Plumb the journal into the upgrade operation and emit j.Warn on each transient read error so a real outage shows up in the report immediately while one-off blips still retry on the next 10s tick. Constructor signature changes; main.go updated. [nit] report.EmitAnnotation called Duration.Round(1) which rounds to 1ns -- a no-op. Use Round(time.Second) to match GenerateMarkdown so annotations read '2h17m' instead of '2h16m59.834561287s'. [nit] monitor.ClusterHealth.WritesHealthy declared but never set or read. Removed; can be added back alongside its populator if/when the writer's healthy/unhealthy signal is wired in. [documentdb#7 deferred] Pinning go.mod to 1.25.x is non-trivial: indirect k8s.io/* deps were tidied against pre-release versions (dated 20260120) that require go 1.26 features. Downgrading them needs its own change (release-pinned k8s.io/* + transitive resolution) rather than just editing the directive. Tracking separately. Verified: go vet ./..., go test ./... (config: 19 specs) clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…ocumentdb#220) [documentdb#6] writer.writeOne now treats mongo.IsDuplicateKeyError as a successful, idempotent ACK. With retryable writes on by default in the v2 driver, a network blip during a disruption window can produce: server commits insert, ACK is dropped, driver auto-retries the same _id, server returns code 11000. The data is durably committed in that case, so counting it as a write failure (which feeds AllowedWriteFailures and the policy gate) was turning successful writes into spurious FAIL verdicts during the very disruption windows the harness exists to measure. [documentdb#5] verifier.verifyWriter now resumes from a per-writer nextSeq map instead of re-scanning the entire writer history every 10s. A 100ms-per-write writer accumulates ~864k docs/day, so the original full scan grew to ~75M doc-reads/hour per writer over a 3-day run, both saturating the cluster and turning the verifier's own load into a confounding signal in the report. Filter is now {writer_id, seq: \$gte\: nextSeq}; nextSeq is advanced to lastSeen+1 each cycle. Gaps are still detected because we always start the scan at the first unverified seq -- a future cycle that finds a gap-fill below nextSeq is intentionally not re-checked (an in-flight delayed write that lands below a previously-seen seq would already have produced a gap on the cycle that observed the higher seq). Verified: go vet ./..., go test ./... clean. Mongo-dependent unit tests for both behaviors land in the next commit alongside the rest of the unit-test sweep. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…ocumentdb#220) Adds the per-package unit-test sweep xgerman flagged in PR documentdb#348 [#3]. Workload writer/verifier remain integration-tested only because their meaningful behavior (mongo retryable writes, dup-key handling, incremental scan filters) requires a real or mocked mongo collection that is out of scope for this PR. journal/policy_test.go covers DisruptionWindow.{IsActive, Duration, ExceededPolicy} including boundary cases (write failures equal to budget allowed; active window evaluated against MustRecoverWithin). journal/journal_test.go covers Record/Info/Warn/Error append semantics, EventsSince cutoff, the OpenDisruptionWindow auto-close-previous behavior, RecordWriteFailure no-op when no active window, HasPolicyViolation across active and closed windows, and a concurrent-append correctness check. monitor/health_test.go uses a fakeClusterClient stub to exercise IsSteadyState transitions: not-yet-healthy, becoming-healthy after the configured wait, loss-of-health resets steadySince, poll-error resets steadySince, and WaitForSteadyState's reached/cancelled paths. monitor/leakdetect_test.go covers the linear-regression slope: below min-samples returns zeroed result, flat memory produces ~0 slope, growing memory above threshold flags HasLeak with the expected slope (~3600 MB/h for 1 MB/s growth), sub-threshold growth does not flag, and the minSamples<3 floor is enforced. operations/scheduler_test.go uses a fakeOp stub to validate selectOperation: nil when no precondition passes, nil when total weight is zero, only-eligible filtering, weighted-distribution within tolerance over 2000 trials, and executeOp's window-open/close + ERROR-event-on-failure behavior. report/report_test.go snapshots GenerateMarkdown invariants: PASS vs FAIL header, duration rounded to seconds, data-plane metrics table always present, disruption-windows and leak-analysis sections gated on data presence, and the recent-events tail truncated to 20. Verified: go test ./... clean. -race not run locally (CGO unavailable on this Windows env); the concurrent-append test still validates final-count correctness without it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
e3a5846 to
d6f9adc
Compare
Adds a section to test/longhaul/README.md explaining how the long haul harness relates to the Ginkgo-based e2e suite landed in PR documentdb#346: - Different shapes (daemon vs test binary) and lifetimes (days vs minutes) - Different operator-API access patterns (dynamic vs typed client) - Concrete e2e helpers (mongo, operatorhealth, clusterprobe) that long haul will likely consume once it grows beyond raw URI / no-TLS Explains why a shared test/shared/ module is deliberately not introduced yet: today's only duplicated surface (raw mongo connect + ping) is too small to justify a third Go module given the version drift between the two test modules. Revisit when the driver adopts e2e's connection model. No code changes. Signed-off-by: Wenting Wu <wentingwu@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closes #220.
What this PR adds
A standalone long-haul (canary) test driver for the DocumentDB Kubernetes Operator, plus the GitHub Actions plumbing to keep it running on a long-lived AKS cluster.
The driver lives under
test/longhaul/and is its own Go module. It is not a Ginkgo suite — it is a long-running binary (cmd/longhaul) deployed as aDeploymentinto a namespace alongside a real DocumentDB CR. It exercises the operator with a continuous, realistic workload while a weighted-random scheduler injects disruptive operations, and it journals every event so failures can be reproduced.Pieces that ship in this PR
workload/): writer + verifier goroutines, sequence / checksum gap detection, lag-aware reads — the durability oracle for FM1/FM3/FM7.operations/): weighted random selection, per-category cooldowns, steady-state gating, and an outage-policy verdict per operation. Operations implemented today:ScaleUp,ScaleDown,UpgradeDocumentDB(auto-skips wheninstancesPerNode < 2because rolling restart on a single instance is unavoidable downtime).monitor/k8sclient.go): reads the DocumentDB CR, watches pods, pulls metrics-server samples — no fakes.monitor/health.go,monitor/leakdetect.go): steady-state detection plus linear-regression slope on memory/CPU samples.journal/): every disruption window, every workload event, every verdict. The disruption-window oracle is what makes verdicts reproducible.report/): writes thelonghaul-reportConfigMap so a human cankubectl get cm longhaul-report -o yamlfor a single source of truth on pass/fail. Alerts surface in the monitor workflow.Dockerfile,deploy/longhaul.yaml): runs as aDeployment(not aJob) — if the process exits, it restarts; the ConfigMap is the durable verdict, not the pod's exit code.workflow_run:LONGHAUL - Build Test Driver Image— builds and pushes to GHCR.LONGHAUL - Deploy Test Driver to AKS— uses a long-lived ServiceAccount-token kubeconfig (LONGHAUL_KUBECONFIGrepo secret) to roll the Deployment.LONGHAUL - Monitor— hourly cron that pulls the report ConfigMap and posts a workflow summary; alerts on stale or failing reports.config,journal,monitor,report,operations— no cluster needed, run withgo test ./...fromtest/longhaul/.docs/designs/long-haul-test-design.md(problem statement, eight failure modes, design principles, architecture, operations catalog with status, env-var table, phase tracker) andtest/longhaul/README.md(how to run, what every env var does, whyDeploymentoverJob).Why this is structured the way it is
Deployment, notJob. AJobthat finishes is aJobno one looks at. Thelonghaul-reportConfigMap is the durable verdict, and the monitor cron makes sure a human sees stale runs.Live canary
I've stood up a long-running cluster — the driver Deployment lives there and gets re-rolled by the deploy workflow on every successful image build:
longhaul-reportConfigMap) runs indocumentdb-test-ns.kubectl logs deploy/longhaul-test -n documentdb-test-nskubectl get cm longhaul-report -n documentdb-test-ns -o yamlWhat's intentionally out of scope
Tracked in the design doc's phase table:
OutagePolicy.AllowedDowntimeenforcement: the field is reserved on the struct; current verdicts useAllowedWriteFailures+MustRecoverWithin. Called out explicitly in the design doc.How to review
If you only want to skim, three files give you the shape of the PR:
docs/designs/long-haul-test-design.md— the why.test/longhaul/cmd/longhaul/main.go— the wiring.test/longhaul/operations/scheduler.go— the heart of the disruption loop.If you want to run it locally:
cd test/longhaul LONGHAUL_CLUSTER_NAME=documentdb-sample LONGHAUL_NAMESPACE=documentdb-test-ns \ LONGHAUL_MAX_DURATION=10m \ go run ./cmd/longhaulHappy to walk through any area on a call.