Skip to content

e2e: backup spec matrix is too thin — only 4 specs covering minimal happy path #368

@xgerman

Description

@xgerman

Context

PR #346 lands a unified Go/Ginkgo E2E suite. While auditing the backup matrix row, the suite was found to contain only 4 specs total — all happy-path. CI shows tests=4, disabled=0, skipped=0 once ginkgo is scoped to ./tests/backup (the larger tests=60 disabled=52 figure visible on earlier runs was junit aggregating other packages compiled by -r ./tests/... and label-filtered out, not backup specs being skipped).

Current backup specs

In test/e2e/tests/backup/:

  1. backup_ondemand_test.go — create on-demand Backup, wait for completion
  2. backup_scheduled_test.go — create ScheduledBackup, observe one child Backup
  3. restore_from_backup_test.go — restore from a completed Backup
  4. restore_from_pv_test.go — restore from PV snapshot

All gated MediumLevelLabel + SkipUnlessLevel(Medium). CI runs the full set at TEST_DEPTH=Medium.

Coverage gaps

Behaviours not covered today, in roughly increasing implementation cost:

  • Multi-replica source cluster — every existing backup spec uses 1 instance. Snapshotting a 3-instance cluster has different timing semantics (which pod's PV gets snapshotted, what happens to standbys mid-snapshot).
  • Concurrent writes during snapshot — establish a writer, take a snapshot, restore, verify the restore boundary is consistent (no torn writes).
  • Retention / cleanupScheduledBackup retention policy: create N backups, advance time, verify oldest get pruned.
  • Restore to a differently-sized cluster — restore a 3-instance backup into a 1-instance target (or vice-versa).
  • Restore-to-different-PVC-size — backup at 1Gi, restore into a CR requesting 5Gi; verify the new PVC honors the request.
  • Failed-snapshot recovery — what happens when the CSI snapshot itself errors mid-flight? Does the Backup CR settle on Failed and is it cleaned up?
  • Concurrent on-demand Backups — two Backup CRs for the same DocumentDB created simultaneously.
  • Empty-cluster backup/restore — backup a freshly-provisioned cluster with no DBs created, restore, verify the restored cluster is usable.
  • CRD validation — invalid Backup / ScheduledBackup specs (missing cluster.name, malformed schedule, retention < 0) get rejected by the API server.

Why it matters

Backup is one of the few features in this operator where a regression silently corrupts user data instead of failing loudly. Four happy-path specs is a thin gate for the only feature where "passes E2E" needs to mean "we trust your data round-trips". As tests/backup is also one of the slowest matrix rows (CSI snapshot + restore + healthcheck per spec), expanding it should be done with fixture sharing in mind — e.g., a per-suite shared writable cluster + a dedicated PV-snapshot fixture rather than a fresh CR per spec.

Suggested next steps

  1. Add the multi-replica + concurrent-writes spec first (highest signal per spec).
  2. Add retention/cleanup next (currently zero coverage of the controller's pruning loop).
  3. Add the negative/CRD-validation specs in a tests/backup/invalid_test.go (cheap, fast, no fixtures).
  4. Defer cross-size restore + failed-snapshot until the simpler ones are in.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions