diff --git a/docs/reservations/committed-resource-reservations.md b/docs/reservations/committed-resource-reservations.md index fe4b17cc0..e94b922a5 100644 --- a/docs/reservations/committed-resource-reservations.md +++ b/docs/reservations/committed-resource-reservations.md @@ -1,54 +1,48 @@ # Committed Resource Reservation System -The committed resource reservation system manages capacity commitments, i.e. strict reservation guarantees usable by projects. -When customers pre-commit to resource usage, Cortex reserves capacity on hypervisors to guarantee availability. -The system integrates with Limes (via the LIQUID protocol) to receive commitments, expose usage and capacity data, and provides acceptance/rejection feedback. - -## File Structure - -```text -internal/scheduling/reservations/commitments/ -├── config.go # Configuration (intervals, API flags, secrets) -├── controller.go # Reconciliation of reservations -├── syncer.go # Periodic sync task with Limes, ensures local state matches Limes' commitments -├── reservation_manager.go # Reservation CRUD operations -├── api.go # HTTP API initialization -├── api_change_commitments.go # Handle commitment changes from Limes and updates local reservations accordingly -├── api_report_usage.go # Report VM usage per project, accounting to commitments or PAYG -├── api_report_capacity.go # Report capacity per AZ -├── api_info.go # Readiness endpoint with versioning (of underlying flavor group configuration) -├── capacity.go # Capacity calculation from Hypervisor CRDs -├── usage.go # VM-to-commitment assignment logic -├── flavor_group_eligibility.go # Validates VMs belong to correct flavor groups -└── state.go # Commitment state helper functions -``` +Cortex reserves hypervisor capacity for customers who pre-commit resources (committed resources, CRs), and exposes usage and capacity data via APIs. + + +- [Committed Resource Reservation System](#committed-resource-reservation-system) + - [Configuration and Observability](#configuration-and-observability) + - [Lifecycle Management](#lifecycle-management) + - [State (CRDs)](#state-crds) + - [CR Reservation Lifecycle](#cr-reservation-lifecycle) + - [VM Lifecycle](#vm-lifecycle) + - [Capacity Blocking](#capacity-blocking) + - [Change-Commitments API](#change-commitments-api) + - [Syncer Task](#syncer-task) + - [Controller (Reconciliation)](#controller-reconciliation) + - [Usage API](#usage-api) -## Operations +The CR reservation implementation is located in `internal/scheduling/reservations/commitments/`. Key components include: +- Controller logic (`controller.go`) +- API endpoints (`api_*.go`) +- Capacity and usage calculation logic (`capacity.go`, `usage.go`) +- Syncer for periodic state sync (`syncer.go`) -### Configuration +## Configuration and Observability -| Helm Value | Description | -|------------|-------------| -| `committedResourceEnableChangeCommitmentsAPI` | Enable/disable the change-commitments endpoint | -| `committedResourceEnableReportUsageAPI` | Enable/disable the usage reporting endpoint | -| `committedResourceEnableReportCapacityAPI` | Enable/disable the capacity reporting endpoint | -| `committedResourceRequeueIntervalActive` | How often to revalidate active reservations | -| `committedResourceRequeueIntervalRetry` | Retry interval when knowledge not ready | -| `committedResourceChangeAPIWatchReservationsTimeout` | Timeout waiting for reservations to become ready while processing commitment changes via API | -| `committedResourcePipelineDefault` | Default scheduling pipeline | -| `committedResourceFlavorGroupPipelines` | Map of flavor group to pipeline name | -| `committedResourceSyncInterval` | How often the syncer reconciles Limes commitments to Reservation CRDs | +**Configuration**: Helm values for intervals, API flags, and pipeline configuration are defined in `helm/bundles/cortex-nova/values.yaml`. Key configuration includes: +- API endpoint toggles (change-commitments, report-usage, report-capacity) — each endpoint can be disabled independently +- Reconciliation intervals (grace period, active monitoring) +- Scheduling pipeline selection per flavor group -Each API endpoint can be disabled independently. The periodic sync task can be disabled by removing it (`commitments-sync-task`) from the list of enabled tasks in the `cortex-nova` Helm chart. +**Metrics and Alerts**: Defined in `helm/bundles/cortex-nova/alerts/nova.alerts.yaml` with prefixes: +- `cortex_committed_resource_change_api_*` +- `cortex_committed_resource_usage_api_*` +- `cortex_committed_resource_capacity_api_*` -### Observability +## Lifecycle Management -Alerts and metrics are defined in `helm/bundles/cortex-nova/alerts/nova.alerts.yaml`. Key metric prefixes: -- `cortex_committed_resource_change_api_*` - Change API metrics -- `cortex_committed_resource_usage_api_*` - Usage API metrics -- `cortex_committed_resource_capacity_api_*` - Capacity API metrics +### State (CRDs) +Defined in `api/v1alpha1/reservation_types.go`, which contains definitions for CR reservations and failover reservations (see [./failover-reservations.md](./failover-reservations.md)). -## Architecture Overview +A reservation CRD represents a single reservation slot on a hypervisor, which holds multiple VMs. +A single CR entry typically refers to multiple reservation CRDs (slots). + + +### CR Reservation Lifecycle ```mermaid flowchart LR @@ -56,49 +50,151 @@ flowchart LR Res[(Reservation CRDs)] end - ChangeAPI[Change API] - UsageAPI[Usage API] Syncer[Syncer Task] + ChangeAPI[Change API] + CapacityAPI[Capacity API] Controller[Controller] + UsageAPI[Usage API] Scheduler[Scheduler API] ChangeAPI -->|CRUD| Res Syncer -->|CRUD| Res UsageAPI -->|read| Res + CapacityAPI -->|read| Res + CapacityAPI -->|capacity request| Scheduler Res -->|watch| Controller Controller -->|update spec/status| Res - Controller -->|placement request| Scheduler + Controller -->|reservation placement request| Scheduler ``` -Reservations are managed through the Change API, Syncer Task, and Controller reconciliation. The Usage API provides read-only access to report usage data back to Limes. +Reservations are managed through the Change API, Syncer Task, and Controller reconciliation. + +| Component | Event | Timing | Action | +|-----------|-------|--------|--------| +| **Change API / Syncer** | CR Create, Resize, Delete | Immediate/Hourly | Create/update/delete Reservation CRDs | +| **Controller** | Placement | On creation | Find host via scheduler API, set `TargetHost` | +| **Controller** | Optimize unused slots | >> minutes | Assign PAYG VMs or re-place reservations | + +### VM Lifecycle + +VM allocations are tracked within reservations: + +```mermaid +flowchart LR + subgraph State + Res[(Reservation CRDs)] + end + A[Nova Scheduler] -->|VM Create/Migrate/Resize| B[Scheduling Pipeline] + B -->|update Spec.Allocations| Res + Res -->|watch| Controller + Res -->|periodic reconcile| Controller + Controller -->|update Spec/Status.Allocations| Res +``` + +| Component | Event | Timing | Action | +|-----------|-------|--------|--------| +| **Scheduling Pipeline** | VM Create, Migrate, Resize | Immediate | Add VM to `Spec.Allocations` | +| **Controller** | Reservation CRD updated | `committedResourceRequeueIntervalGracePeriod` (default: 1 min) | Verify new VMs via Nova API; update `Status.Allocations` | +| **Controller** | Periodic check | `committedResourceRequeueIntervalActive` (default: 5 min) | Verify established VMs via Hypervisor CRD; remove gone VMs from `Spec.Allocations` | + +**Allocation fields**: +- `Spec.Allocations` — Expected VMs (written by the scheduling pipeline on placement) +- `Status.Allocations` — Confirmed VMs (written by the controller after verifying the VM is on the expected host) + +**VM allocation state diagram**: + +The controller uses two sources to verify VM allocations, depending on how recently the VM was placed: +- **Nova API** — used during the grace period (`committedResourceAllocationGracePeriod`, default: 15 min) where the VM may still be starting up; provides real-time host assignment +- **Hypervisor CRD** — used for established allocations; reflects the set of instances the hypervisor operator observes on the host + +```mermaid +stateDiagram-v2 + direction LR + [*] --> SpecOnly : placement (create, migrate, resize) + SpecOnly --> Confirmed : on expected host + SpecOnly --> WrongHost : on different host + SpecOnly --> [*] : not confirmed after grace period + Confirmed --> WrongHost : not on HV CRD, found elsewhere + Confirmed --> [*] : not on HV CRD, Nova 404 + WrongHost --> Confirmed : back on expected host + WrongHost --> [*] : VM gone (404) + WrongHost --> [*] : on wrong host > grace period + + state "Spec only (grace period)" as SpecOnly + state "Spec + Status (on expected host)" as Confirmed + state "Spec + Status (host mismatch)" as WrongHost +``` + +**Note**: VM allocations may not consume all resources of a reservation slot. A reservation with 128 GB may have VMs totaling only 96 GB if that fits the project's needs. Allocations may exceed reservation capacity (e.g., after VM resize). + +### Capacity Blocking + +**Blocking rules by allocation state:** + +| State | In HV Allocation? | Reservation must block? | +|---|---|---| +| No allocations | — | Full `Spec.Resources` | +| Confirmed (Spec + Status) | Yes — already subtracted | No — subtract from reservation block | +| Spec only (not yet running) | No — not yet on host | Yes — must remain in reservation block | + +**Formal calculation (stable state, `Spec.TargetHost == Status.Host`):** + +``` +confirmed = sum of resources for VMs in both Spec.Allocations and Status.Allocations +spec_only_unblocked = sum of resources for VMs in Spec.Allocations only, NOT having an active pessimistic blocking reservation on this host +remaining = max(0, Spec.Resources - confirmed) +block = max(remaining, spec_only_unblocked) +``` + +**Interaction with pessimistic blocking reservations:** + +When a VM is in flight (Nova choosing between candidates), a pessimistic blocking reservation exists on each candidate host. For any SpecOnly VM that has such a reservation on the same host, the pessimistic blocking reservation is the authority — the CR reservation must not double-count it. The `spec_only_unblocked` term excludes those VMs. + +See [pessimistic-blocking-reservations.md](./pessimistic-blocking-reservations.md) for the full interaction semantics. + +**Migration state (`Spec.TargetHost != Status.Host`):** + +When a reservation is being migrated to a new host, block the full `max(Spec.Resources, spec_only_unblocked)` on **both** hosts — no subtraction of confirmed VMs. VMs may be split across hosts mid-migration and the split is not reliably known from reservation data alone; conservatively blocking both hosts prevents overcommit during the transition. The over-blocking resolves once migration completes and `Spec.TargetHost == Status.Host` again. + +**Corner cases:** + +- **Confirmed VMs exceed reservation size** (e.g., after VM resize): `Spec.Resources - confirmed` goes negative. Clamp to `0` — otherwise the filter would add capacity back to the host. + +- **Spec-only VM larger than remaining reservation** (e.g., confirmed VMs have consumed most of the slot, and a new VM awaiting startup is larger than what remains): `remaining < spec_only_unblocked`. Block `spec_only_unblocked` — the VM will consume those resources when it starts, and they are not yet in HV Allocation. + +- **VM live migration within a reservation** (VM moves away from the reservation's host): handled implicitly by `hv.Status.Allocation`. Libvirt reports resource consumption on both source and target during live migration, so both hosts' `hv.Status.Allocation` already reflects the in-flight state. No special filter logic needed. The reservation controller will eventually remove the VM from the reservation once it's confirmed on the wrong host past the grace period. ### Change-Commitments API -The change-commitments API receives batched commitment changes from Limes. A request can contain multiple commitment changes across different projects and flavor groups. The semantic is **all-or-nothing**: if any commitment in the batch cannot be fulfilled (e.g., insufficient capacity), the entire request is rejected and rolled back. +The change-commitments API receives batched commitment changes from Limes and manages reservations accordingly. + +**Request Semantics**: A request can contain multiple commitment changes across different projects and flavor groups. The semantic is **all-or-nothing** — if any commitment in the batch cannot be fulfilled (e.g., insufficient capacity), the entire request is rejected and rolled back. -Cortex performs CRUD operations on local Reservation CRDs to match the new desired state: +**Operations**: Cortex performs CRUD operations on local Reservation CRDs to match the new desired state: - Creates new reservations for increased commitment amounts -- Deletes existing reservations -- Cortex preserves existing reservations that already have VMs allocated when possible +- Deletes existing reservations for decreased commitments +- Preserves existing reservations that already have VMs allocated when possible ### Syncer Task -The syncer task runs periodically and fetches all commitments from Limes. It syncs the local Reservation CRD state to match Limes' view of commitments. +The syncer task runs periodically and syncs local Reservation CRD state to match Limes' view of commitments, correcting drift from missed API calls or restarts. ### Controller (Reconciliation) -The controller watches Reservation CRDs and performs reconciliation: +The controller watches Reservation CRDs and performs two types of reconciliation: -1. **For new reservations** (no target host assigned): - - Calls Cortex for scheduling to find a suitable host - - Assigns the target host and marks the reservation as Ready +**Placement** - Finds hosts for new reservations (calls scheduler API) -2. **For existing reservations** (already have a target host): - - Validates that allocated VMs are still on the expected host - - Updates allocations if VMs have migrated or been deleted - - Requeues for periodic revalidation +**Allocation Verification** - Tracks VM lifecycle on reservations. VMs take time to appear on a host after scheduling, so new allocations are verified more frequently via the Nova API for real-time status, while established allocations are verified via the Hypervisor CRD: +- New VMs (within `committedResourceAllocationGracePeriod`, default: 15 min): checked via Nova API every `committedResourceRequeueIntervalGracePeriod` (default: 1 min) +- Established VMs: checked via Hypervisor CRD every `committedResourceRequeueIntervalActive` (default: 5 min) +- Missing VMs: removed from `Spec.Allocations` after Nova API confirms 404 ### Usage API -This API reports for a given project the total committed resources and usage per flavor group. For each VM, it reports whether the VM accounts to a specific commitment or PAYG. This assignment is deterministic and may differ from the actual Cortex internal assignment used for scheduling. +For each flavor group `X` that accepts commitments, Cortex exposes three resource types: +- `hw_version_X_ram` — RAM in units of the smallest flavor in the group (`HandlesCommitments=true`) +- `hw_version_X_cores` — CPU cores derived from RAM via fixed ratio (`HandlesCommitments=false`) +- `hw_version_X_instances` — instance count (`HandlesCommitments=false`) +For each VM, the API reports whether it accounts to a specific commitment or PAYG. This assignment is deterministic and may differ from the actual Cortex internal assignment used for scheduling. \ No newline at end of file diff --git a/internal/scheduling/reservations/commitments/config.go b/internal/scheduling/reservations/commitments/config.go index 7a6c9005f..17d89c8c0 100644 --- a/internal/scheduling/reservations/commitments/config.go +++ b/internal/scheduling/reservations/commitments/config.go @@ -11,10 +11,17 @@ import ( type Config struct { - // RequeueIntervalActive is the interval for requeueing active reservations for verification. + // RequeueIntervalActive is the interval for requeueing active reservations for periodic verification. RequeueIntervalActive time.Duration `json:"committedResourceRequeueIntervalActive"` // RequeueIntervalRetry is the interval for requeueing when retrying after knowledge is not ready. RequeueIntervalRetry time.Duration `json:"committedResourceRequeueIntervalRetry"` + // AllocationGracePeriod is the time window after a VM is allocated to a reservation + // during which it's expected to appear on the target host. VMs not confirmed within + // this period are considered stale and removed from the reservation. + AllocationGracePeriod time.Duration `json:"committedResourceAllocationGracePeriod"` + // RequeueIntervalGracePeriod is the interval for requeueing when VMs are in grace period. + // Shorter than RequeueIntervalActive for faster verification of new allocations. + RequeueIntervalGracePeriod time.Duration `json:"committedResourceRequeueIntervalGracePeriod"` // PipelineDefault is the default pipeline used for scheduling committed resource reservations. PipelineDefault string `json:"committedResourcePipelineDefault"` @@ -68,6 +75,12 @@ func (c *Config) ApplyDefaults() { if c.RequeueIntervalRetry == 0 { c.RequeueIntervalRetry = defaults.RequeueIntervalRetry } + if c.RequeueIntervalGracePeriod == 0 { + c.RequeueIntervalGracePeriod = defaults.RequeueIntervalGracePeriod + } + if c.AllocationGracePeriod == 0 { + c.AllocationGracePeriod = defaults.AllocationGracePeriod + } if c.PipelineDefault == "" { c.PipelineDefault = defaults.PipelineDefault } @@ -88,6 +101,8 @@ func DefaultConfig() Config { return Config{ RequeueIntervalActive: 5 * time.Minute, RequeueIntervalRetry: 1 * time.Minute, + RequeueIntervalGracePeriod: 1 * time.Minute, + AllocationGracePeriod: 15 * time.Minute, PipelineDefault: "kvm-general-purpose-load-balancing", SchedulerURL: "http://localhost:8080/scheduler/nova/external", ChangeAPIWatchReservationsTimeout: 10 * time.Second, diff --git a/internal/scheduling/reservations/commitments/controller.go b/internal/scheduling/reservations/commitments/controller.go index d38c6e1d8..c32c6873d 100644 --- a/internal/scheduling/reservations/commitments/controller.go +++ b/internal/scheduling/reservations/commitments/controller.go @@ -21,11 +21,13 @@ import ( "sigs.k8s.io/controller-runtime/pkg/manager" "sigs.k8s.io/controller-runtime/pkg/predicate" + novaservers "github.com/gophercloud/gophercloud/v2/openstack/compute/v2/servers" + schedulerdelegationapi "github.com/cobaltcore-dev/cortex/api/external/nova" "github.com/cobaltcore-dev/cortex/api/v1alpha1" - "github.com/cobaltcore-dev/cortex/internal/knowledge/datasources/plugins/openstack/nova" "github.com/cobaltcore-dev/cortex/internal/knowledge/db" "github.com/cobaltcore-dev/cortex/internal/knowledge/extractor/plugins/compute" + schedulingnova "github.com/cobaltcore-dev/cortex/internal/scheduling/nova" "github.com/cobaltcore-dev/cortex/internal/scheduling/reservations" "github.com/cobaltcore-dev/cortex/pkg/multicluster" hv1 "github.com/cobaltcore-dev/openstack-hypervisor-operator/api/v1" @@ -44,6 +46,8 @@ type CommitmentReservationController struct { DB *db.DB // SchedulerClient for making scheduler API calls. SchedulerClient *reservations.SchedulerClient + // NovaClient for direct Nova API calls (real-time VM status). + NovaClient schedulingnova.NovaClient } // Reconcile is part of the main kubernetes reconciliation loop which aims to @@ -96,13 +100,18 @@ func (r *CommitmentReservationController) Reconcile(ctx context.Context, req ctr if meta.IsStatusConditionTrue(res.Status.Conditions, v1alpha1.ReservationConditionReady) { logger.V(1).Info("reservation is active, verifying allocations") - // Verify all allocations in Spec against actual VM state from database - if err := r.reconcileAllocations(ctx, &res); err != nil { + // Verify all allocations in Spec against actual VM state + result, err := r.reconcileAllocations(ctx, &res) + if err != nil { logger.Error(err, "failed to reconcile allocations") return ctrl.Result{}, err } - // Requeue periodically to keep verifying allocations + // Requeue with appropriate interval based on allocation state + // Use shorter interval if there are allocations in grace period for faster verification + if result.HasAllocationsInGracePeriod { + return ctrl.Result{RequeueAfter: r.Conf.RequeueIntervalGracePeriod}, nil + } return ctrl.Result{RequeueAfter: r.Conf.RequeueIntervalActive}, nil } @@ -322,82 +331,211 @@ func (r *CommitmentReservationController) Reconcile(ctx context.Context, req ctr return ctrl.Result{}, nil } -// reconcileAllocations verifies all allocations in Spec against actual Nova VM state. +// reconcileAllocationsResult holds the outcome of allocation reconciliation. +type reconcileAllocationsResult struct { + // HasAllocationsInGracePeriod is true if any allocations are still in grace period. + HasAllocationsInGracePeriod bool +} + +// reconcileAllocations verifies all allocations in Spec against actual VM state. // It updates Status.Allocations based on the actual host location of each VM. -func (r *CommitmentReservationController) reconcileAllocations(ctx context.Context, res *v1alpha1.Reservation) error { +// For new allocations (within grace period), it uses the Nova API for real-time status. +// For older allocations, it uses the Hypervisor CRD to check if VM is on the expected host. +func (r *CommitmentReservationController) reconcileAllocations(ctx context.Context, res *v1alpha1.Reservation) (*reconcileAllocationsResult, error) { logger := LoggerFromContext(ctx).WithValues("component", "controller") + result := &reconcileAllocationsResult{} + now := time.Now() // Skip if no CommittedResourceReservation if res.Spec.CommittedResourceReservation == nil { - return nil + return result, nil } - // TODO trigger migrations of unused reservations (to PAYG VMs) - // Skip if no allocations to verify if len(res.Spec.CommittedResourceReservation.Allocations) == 0 { logger.V(1).Info("no allocations to verify", "reservation", res.Name) - return nil + return result, nil } - // Query all VMs for this project from the database - projectID := res.Spec.CommittedResourceReservation.ProjectID - serverMap, err := r.listServersByProjectID(ctx, projectID) - if err != nil { - return fmt.Errorf("failed to list servers for project %s: %w", projectID, err) + expectedHost := res.Status.Host + + // Fetch the Hypervisor CRD for the expected host (for older allocations) + var hypervisor hv1.Hypervisor + hvInstanceSet := make(map[string]bool) + if expectedHost != "" { + if err := r.Get(ctx, client.ObjectKey{Name: expectedHost}, &hypervisor); err != nil { + if client.IgnoreNotFound(err) != nil { + return nil, fmt.Errorf("failed to get hypervisor %s: %w", expectedHost, err) + } + // Hypervisor not found - all older allocations will be checked via Nova API fallback + logger.Info("hypervisor CRD not found", "host", expectedHost) + } else { + // Build set of all VM UUIDs on this hypervisor for O(1) lookup + // Include both active and inactive VMs - stopped/shelved VMs still consume the reservation slot + for _, inst := range hypervisor.Status.Instances { + hvInstanceSet[inst.ID] = true + } + logger.V(1).Info("fetched hypervisor instances", "host", expectedHost, "instanceCount", len(hvInstanceSet)) + } } - // initialize + // Initialize status if res.Status.CommittedResourceReservation == nil { res.Status.CommittedResourceReservation = &v1alpha1.CommittedResourceReservationStatus{} } // Build new Status.Allocations map based on actual VM locations newStatusAllocations := make(map[string]string) + // Track allocations to remove from Spec (stale/leaving VMs) + var allocationsToRemove []string + + for vmUUID, allocation := range res.Spec.CommittedResourceReservation.Allocations { + allocationAge := now.Sub(allocation.CreationTimestamp.Time) + isInGracePeriod := allocationAge < r.Conf.AllocationGracePeriod + + if isInGracePeriod { + // New allocation: use Nova API for real-time status + result.HasAllocationsInGracePeriod = true + + if r.NovaClient == nil { + // No Nova client - skip verification for now, retry later + logger.V(1).Info("Nova client not available, skipping new allocation verification", + "vm", vmUUID, + "allocationAge", allocationAge) + continue + } - for vmUUID := range res.Spec.CommittedResourceReservation.Allocations { - server, exists := serverMap[vmUUID] - if exists { - // VM found - record its actual host location - actualHost := server.OSEXTSRVATTRHost - newStatusAllocations[vmUUID] = actualHost - - logger.V(1).Info("verified VM allocation", - "vm", vmUUID, - "actualHost", actualHost, - "expectedHost", res.Status.Host) - } else { - // VM not found in database - logger.Info("VM not found in database", - "vm", vmUUID, - "reservation", res.Name, - "projectID", projectID) + server, err := r.NovaClient.Get(ctx, vmUUID) + if err != nil { + // VM not yet available in Nova (still spawning) - retry on next reconcile + logger.V(1).Info("VM not yet available in Nova API", + "vm", vmUUID, + "error", err.Error(), + "allocationAge", allocationAge) + // Keep in Spec, don't add to Status - will retry on next reconcile + continue + } - // TODO handle entering and leave event + actualHost := server.ComputeHost + switch { + case actualHost == expectedHost: + // VM is on expected host - confirmed running + newStatusAllocations[vmUUID] = actualHost + logger.V(1).Info("verified new VM allocation via Nova API", + "vm", vmUUID, + "actualHost", actualHost, + "allocationAge", allocationAge) + case actualHost != "": + // VM is on different host - migration scenario (log for now) + newStatusAllocations[vmUUID] = actualHost + logger.Info("VM on different host than expected (migration?)", + "vm", vmUUID, + "actualHost", actualHost, + "expectedHost", expectedHost, + "allocationAge", allocationAge) + default: + // VM not yet on any host - still spawning + logger.V(1).Info("VM not yet on host (spawning)", + "vm", vmUUID, + "status", server.Status, + "allocationAge", allocationAge) + // Keep in Spec, don't add to Status - will retry on next reconcile + } + } else { + // Older allocation: use Hypervisor CRD for verification + if hvInstanceSet[vmUUID] { + // VM found on expected hypervisor - confirmed running + newStatusAllocations[vmUUID] = expectedHost + logger.V(1).Info("verified VM allocation via Hypervisor CRD", + "vm", vmUUID, + "host", expectedHost) + } else { + // VM not found on expected hypervisor - check Nova API as fallback + if r.NovaClient != nil { + novaServer, err := r.NovaClient.Get(ctx, vmUUID) + if err == nil && novaServer.ComputeHost != "" { + // VM is on a different host past the grace period - remove from this reservation. + // The grace period is the window for VMs in transit; past that, a VM on the + // wrong host is no longer consuming this slot. + logger.Info("VM on different host past grace period, removing from reservation", + "vm", vmUUID, + "actualHost", novaServer.ComputeHost, + "expectedHost", expectedHost, + "allocationAge", allocationAge) + // fall through to removal + } else { + var notFound *novaservers.ErrServerNotFound + if err != nil && !errors.As(err, ¬Found) { + // Transient Nova API error - skip removal to avoid evicting live VMs + return nil, fmt.Errorf("nova API error checking VM %s: %w", vmUUID, err) + } + // err is nil (VM has no host yet) or 404 (VM gone) - fall through to removal + logger.V(1).Info("Nova API confirmed VM not found or has no host", + "vm", vmUUID, + "error", err) + } + } + // VM not found on hypervisor and not in Nova - mark for removal (leaving VM) + allocationsToRemove = append(allocationsToRemove, vmUUID) + logger.Info("removing stale allocation (VM not found on hypervisor or Nova)", + "vm", vmUUID, + "reservation", res.Name, + "expectedHost", expectedHost, + "allocationAge", allocationAge, + "gracePeriod", r.Conf.AllocationGracePeriod) + } } } - // Patch the reservation status + // Patch the reservation old := res.DeepCopy() + specChanged := false + + // Remove stale allocations from Spec + if len(allocationsToRemove) > 0 { + for _, vmUUID := range allocationsToRemove { + delete(res.Spec.CommittedResourceReservation.Allocations, vmUUID) + } + specChanged = true + } // Update Status.Allocations res.Status.CommittedResourceReservation.Allocations = newStatusAllocations + // Patch Spec if changed (stale allocations removed) + if specChanged { + if err := r.Patch(ctx, res, client.MergeFrom(old)); err != nil { + if client.IgnoreNotFound(err) == nil { + return result, nil + } + return nil, fmt.Errorf("failed to patch reservation spec: %w", err) + } + // Re-fetch to get the updated resource version for status patch + if err := r.Get(ctx, client.ObjectKeyFromObject(res), res); err != nil { + if client.IgnoreNotFound(err) == nil { + return result, nil + } + return nil, fmt.Errorf("failed to re-fetch reservation: %w", err) + } + old = res.DeepCopy() + } + + // Patch Status patch := client.MergeFrom(old) if err := r.Status().Patch(ctx, res, patch); err != nil { - // Ignore not-found errors during background deletion if client.IgnoreNotFound(err) == nil { - // Object was deleted, no need to continue - return nil + return result, nil } - return fmt.Errorf("failed to patch reservation status: %w", err) + return nil, fmt.Errorf("failed to patch reservation status: %w", err) } logger.V(1).Info("reconciled allocations", "specAllocations", len(res.Spec.CommittedResourceReservation.Allocations), - "statusAllocations", len(newStatusAllocations)) + "statusAllocations", len(newStatusAllocations), + "removedAllocations", len(allocationsToRemove), + "hasAllocationsInGracePeriod", result.HasAllocationsInGracePeriod) - return nil + return result, nil } // getPipelineForFlavorGroup returns the pipeline name for a given flavor group. @@ -432,36 +570,20 @@ func (r *CommitmentReservationController) Init(ctx context.Context, client clien r.SchedulerClient = reservations.NewSchedulerClient(conf.SchedulerURL) logf.FromContext(ctx).Info("scheduler client initialized for commitment reservation controller", "url", conf.SchedulerURL) - return nil -} - -func (r *CommitmentReservationController) listServersByProjectID(ctx context.Context, projectID string) (map[string]*nova.Server, error) { - if r.DB == nil { - return nil, errors.New("database connection not initialized") - } - - logger := LoggerFromContext(ctx).WithValues("component", "controller") - - // Query servers from the database cache. - var servers []nova.Server - _, err := r.DB.Select(&servers, - "SELECT * FROM "+nova.Server{}.TableName()+" WHERE tenant_id = $1", - projectID) - if err != nil { - return nil, fmt.Errorf("failed to query servers from database: %w", err) - } - - logger.V(1).Info("queried servers from database", - "projectID", projectID, - "serverCount", len(servers)) - - // Build lookup map for O(1) access by VM UUID. - serverMap := make(map[string]*nova.Server, len(servers)) - for i := range servers { - serverMap[servers[i].ID] = &servers[i] + // Initialize Nova client for real-time VM status checks (optional). + // Skip if NovaClient is already set (e.g., injected for testing) or if keystone not configured. + if r.NovaClient == nil && conf.KeystoneSecretRef.Name != "" { + r.NovaClient = schedulingnova.NewNovaClient() + if err := r.NovaClient.Init(ctx, client, schedulingnova.NovaClientConfig{ + KeystoneSecretRef: conf.KeystoneSecretRef, + SSOSecretRef: conf.SSOSecretRef, + }); err != nil { + return fmt.Errorf("failed to initialize Nova client: %w", err) + } + logf.FromContext(ctx).Info("Nova client initialized for commitment reservation controller") } - return serverMap, nil + return nil } // commitmentReservationPredicate filters to only watch commitment reservations. diff --git a/internal/scheduling/reservations/commitments/controller_test.go b/internal/scheduling/reservations/commitments/controller_test.go index 5af5dfca9..68d13faf3 100644 --- a/internal/scheduling/reservations/commitments/controller_test.go +++ b/internal/scheduling/reservations/commitments/controller_test.go @@ -18,6 +18,7 @@ import ( "k8s.io/apimachinery/pkg/runtime" "k8s.io/apimachinery/pkg/types" ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/client" "sigs.k8s.io/controller-runtime/pkg/client/fake" schedulerdelegationapi "github.com/cobaltcore-dev/cortex/api/external/nova" @@ -140,6 +141,225 @@ func TestCommitmentReservationController_Reconcile(t *testing.T) { } } +// ============================================================================ +// Test: reconcileAllocations +// ============================================================================ + +// Note: Full reconcileAllocations tests require mocking NovaClient, which uses +// unexported types (nova.server, nova.migration). Tests for the Nova API path +// would need to be placed in the nova package or the types would need to be exported. +// For now, we test only the Hypervisor CRD path (when NovaClient is nil). + +func TestReconcileAllocations_HypervisorCRDPath(t *testing.T) { + scheme := runtime.NewScheme() + if err := v1alpha1.AddToScheme(scheme); err != nil { + t.Fatalf("Failed to add scheme: %v", err) + } + if err := hv1.AddToScheme(scheme); err != nil { + t.Fatalf("Failed to add hypervisor scheme: %v", err) + } + + now := time.Now() + recentTime := metav1.NewTime(now.Add(-5 * time.Minute)) // 5 minutes ago (within grace period) + oldTime := metav1.NewTime(now.Add(-30 * time.Minute)) // 30 minutes ago (past grace period) + + tests := []struct { + name string + reservation *v1alpha1.Reservation + hypervisor *hv1.Hypervisor + config Config + expectedStatusAllocations map[string]string + expectedHasGracePeriodAllocs bool + }{ + { + name: "old allocation - VM found on hypervisor CRD", + reservation: newTestCRReservation(map[string]metav1.Time{ + "vm-1": oldTime, + }), + hypervisor: newTestHypervisorCRD("host-1", []hv1.Instance{ + {ID: "vm-1", Name: "vm-1", Active: true}, + }), + config: Config{AllocationGracePeriod: 15 * time.Minute}, + expectedStatusAllocations: map[string]string{"vm-1": "host-1"}, + expectedHasGracePeriodAllocs: false, + }, + { + name: "old allocation - inactive VM still counted (stopped/shelved)", + reservation: newTestCRReservation(map[string]metav1.Time{ + "vm-stopped": oldTime, + }), + hypervisor: newTestHypervisorCRD("host-1", []hv1.Instance{ + {ID: "vm-stopped", Name: "vm-stopped", Active: false}, // Inactive VM should still be found + }), + config: Config{AllocationGracePeriod: 15 * time.Minute}, + expectedStatusAllocations: map[string]string{"vm-stopped": "host-1"}, + expectedHasGracePeriodAllocs: false, + }, + { + name: "old allocation - VM not on hypervisor CRD (no NovaClient fallback)", + reservation: newTestCRReservation(map[string]metav1.Time{ + "vm-1": oldTime, + }), + hypervisor: newTestHypervisorCRD("host-1", []hv1.Instance{}), // Empty + config: Config{AllocationGracePeriod: 15 * time.Minute}, + expectedStatusAllocations: map[string]string{}, // Not confirmed + expectedHasGracePeriodAllocs: false, + }, + { + name: "new allocation within grace period - no Nova client", + reservation: newTestCRReservation(map[string]metav1.Time{ + "vm-1": recentTime, + }), + hypervisor: nil, + config: Config{AllocationGracePeriod: 15 * time.Minute}, + expectedStatusAllocations: map[string]string{}, // Can't verify without Nova + expectedHasGracePeriodAllocs: true, + }, + { + name: "mixed allocations - old verified via CRD, new in grace period", + reservation: newTestCRReservation(map[string]metav1.Time{ + "vm-new": recentTime, // In grace period + "vm-old": oldTime, // Past grace period + }), + hypervisor: newTestHypervisorCRD("host-1", []hv1.Instance{ + {ID: "vm-old", Name: "vm-old", Active: true}, + }), + config: Config{AllocationGracePeriod: 15 * time.Minute}, + expectedStatusAllocations: map[string]string{"vm-old": "host-1"}, // Only old one confirmed via CRD + expectedHasGracePeriodAllocs: true, + }, + { + name: "empty allocations - no work to do", + reservation: newTestCRReservation(map[string]metav1.Time{}), + hypervisor: nil, + config: Config{AllocationGracePeriod: 15 * time.Minute}, + expectedStatusAllocations: map[string]string{}, + expectedHasGracePeriodAllocs: false, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + // Build fake client with objects + objects := []client.Object{tt.reservation} + if tt.hypervisor != nil { + objects = append(objects, tt.hypervisor) + } + + k8sClient := fake.NewClientBuilder(). + WithScheme(scheme). + WithObjects(objects...). + WithStatusSubresource(&v1alpha1.Reservation{}). + Build() + + controller := &CommitmentReservationController{ + Client: k8sClient, + Scheme: scheme, + Conf: tt.config, + NovaClient: nil, // No NovaClient - testing Hypervisor CRD path only + } + + ctx := WithNewGlobalRequestID(context.Background()) + result, err := controller.reconcileAllocations(ctx, tt.reservation) + if err != nil { + t.Fatalf("reconcileAllocations() error = %v", err) + } + + // Check grace period result + if result.HasAllocationsInGracePeriod != tt.expectedHasGracePeriodAllocs { + t.Errorf("expected HasAllocationsInGracePeriod=%v, got %v", + tt.expectedHasGracePeriodAllocs, result.HasAllocationsInGracePeriod) + } + + // Re-fetch reservation to check updates + var updated v1alpha1.Reservation + if err := k8sClient.Get(ctx, client.ObjectKeyFromObject(tt.reservation), &updated); err != nil { + t.Fatalf("failed to get updated reservation: %v", err) + } + + // Check status allocations + actualStatusAllocs := map[string]string{} + if updated.Status.CommittedResourceReservation != nil { + actualStatusAllocs = updated.Status.CommittedResourceReservation.Allocations + } + + if len(actualStatusAllocs) != len(tt.expectedStatusAllocations) { + t.Errorf("expected %d status allocations, got %d: %v", + len(tt.expectedStatusAllocations), len(actualStatusAllocs), actualStatusAllocs) + } + + for vmUUID, expectedHost := range tt.expectedStatusAllocations { + if actualHost, ok := actualStatusAllocs[vmUUID]; !ok { + t.Errorf("expected VM %s in status allocations", vmUUID) + } else if actualHost != expectedHost { + t.Errorf("VM %s: expected host %s, got %s", vmUUID, expectedHost, actualHost) + } + } + }) + } +} + +// newTestCRReservation creates a test CR reservation with allocations on "host-1". +func newTestCRReservation(allocations map[string]metav1.Time) *v1alpha1.Reservation { + const host = "host-1" + specAllocs := make(map[string]v1alpha1.CommittedResourceAllocation) + for vmUUID, timestamp := range allocations { + specAllocs[vmUUID] = v1alpha1.CommittedResourceAllocation{ + CreationTimestamp: timestamp, + Resources: map[hv1.ResourceName]resource.Quantity{ + "memory": resource.MustParse("4Gi"), + "cpu": resource.MustParse("2"), + }, + } + } + + return &v1alpha1.Reservation{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-reservation", + }, + Spec: v1alpha1.ReservationSpec{ + Type: v1alpha1.ReservationTypeCommittedResource, + TargetHost: host, + CommittedResourceReservation: &v1alpha1.CommittedResourceReservationSpec{ + ProjectID: "test-project", + ResourceName: "test-flavor", + Allocations: specAllocs, + }, + }, + Status: v1alpha1.ReservationStatus{ + Host: host, + Conditions: []metav1.Condition{ + { + Type: v1alpha1.ReservationConditionReady, + Status: metav1.ConditionTrue, + Reason: "ReservationActive", + }, + }, + CommittedResourceReservation: &v1alpha1.CommittedResourceReservationStatus{ + Allocations: make(map[string]string), + }, + }, + } +} + +// newTestHypervisorCRD creates a test Hypervisor CRD with instances. +// +//nolint:unparam // name parameter allows future test flexibility +func newTestHypervisorCRD(name string, instances []hv1.Instance) *hv1.Hypervisor { + return &hv1.Hypervisor{ + ObjectMeta: metav1.ObjectMeta{ + Name: name, + }, + Status: hv1.HypervisorStatus{ + Instances: instances, + }, + } +} + +// ============================================================================ +// Test: reconcileInstanceReservation_Success (existing test) +// ============================================================================ + func TestCommitmentReservationController_reconcileInstanceReservation_Success(t *testing.T) { scheme := runtime.NewScheme() if err := v1alpha1.AddToScheme(scheme); err != nil {