cobaltcore-dev · mblos · Mar 26, 2026 · Mar 26, 2026 · Mar 26, 2026 · Mar 26, 2026
@@ -1,104 +1,200 @@
 # Committed Resource Reservation System
 
-The committed resource reservation system manages capacity commitments, i.e. strict reservation guarantees usable by projects. 
-When customers pre-commit to resource usage, Cortex reserves capacity on hypervisors to guarantee availability.
-The system integrates with Limes (via the LIQUID protocol) to receive commitments, expose usage and capacity data, and provides acceptance/rejection feedback.
-
-## File Structure
-
-```text
-internal/scheduling/reservations/commitments/
-├── config.go                          # Configuration (intervals, API flags, secrets)
-├── controller.go                      # Reconciliation of reservations
-├── syncer.go                          # Periodic sync task with Limes, ensures local state matches Limes' commitments
-├── reservation_manager.go             # Reservation CRUD operations
-├── api.go                             # HTTP API initialization
-├── api_change_commitments.go          # Handle commitment changes from Limes and updates local reservations accordingly
-├── api_report_usage.go                # Report VM usage per project, accounting to commitments or PAYG
-├── api_report_capacity.go             # Report capacity per AZ
-├── api_info.go                        # Readiness endpoint with versioning (of underlying flavor group configuration)
-├── capacity.go                        # Capacity calculation from Hypervisor CRDs
-├── usage.go                           # VM-to-commitment assignment logic
-├── flavor_group_eligibility.go        # Validates VMs belong to correct flavor groups
-└── state.go                           # Commitment state helper functions
-```
+Cortex reserves hypervisor capacity for customers who pre-commit resources (committed resources, CRs), and exposes usage and capacity data via APIs.
+
+
+- [Committed Resource Reservation System](#committed-resource-reservation-system)
+  - [Configuration and Observability](#configuration-and-observability)
+  - [Lifecycle Management](#lifecycle-management)
+    - [State (CRDs)](#state-crds)
+    - [CR Reservation Lifecycle](#cr-reservation-lifecycle)
+    - [VM Lifecycle](#vm-lifecycle)
+    - [Capacity Blocking](#capacity-blocking)
+    - [Change-Commitments API](#change-commitments-api)
+    - [Syncer Task](#syncer-task)
+    - [Controller (Reconciliation)](#controller-reconciliation)
+    - [Usage API](#usage-api)
 
-## Operations
+The CR reservation implementation is located in `internal/scheduling/reservations/commitments/`. Key components include:
+- Controller logic (`controller.go`)
+- API endpoints (`api_*.go`)
+- Capacity and usage calculation logic (`capacity.go`, `usage.go`)
+- Syncer for periodic state sync (`syncer.go`)
 
-### Configuration
+## Configuration and Observability
 
-| Helm Value | Description |
-|------------|-------------|
-| `committedResourceEnableChangeCommitmentsAPI` | Enable/disable the change-commitments endpoint |
-| `committedResourceEnableReportUsageAPI` | Enable/disable the usage reporting endpoint |
-| `committedResourceEnableReportCapacityAPI` | Enable/disable the capacity reporting endpoint |
-| `committedResourceRequeueIntervalActive` | How often to revalidate active reservations |
-| `committedResourceRequeueIntervalRetry` | Retry interval when knowledge not ready |
-| `committedResourceChangeAPIWatchReservationsTimeout` | Timeout waiting for reservations to become ready while processing commitment changes via API |
-| `committedResourcePipelineDefault` | Default scheduling pipeline |
-| `committedResourceFlavorGroupPipelines` | Map of flavor group to pipeline name |
-| `committedResourceSyncInterval` | How often the syncer reconciles Limes commitments to Reservation CRDs |
+**Configuration**: Helm values for intervals, API flags, and pipeline configuration are defined in `helm/bundles/cortex-nova/values.yaml`. Key configuration includes:
+- API endpoint toggles (change-commitments, report-usage, report-capacity) — each endpoint can be disabled independently
+- Reconciliation intervals (grace period, active monitoring)
+- Scheduling pipeline selection per flavor group
 
-Each API endpoint can be disabled independently. The periodic sync task can be disabled by removing it (`commitments-sync-task`) from the list of enabled tasks in the `cortex-nova` Helm chart.
+**Metrics and Alerts**: Defined in `helm/bundles/cortex-nova/alerts/nova.alerts.yaml` with prefixes:
+- `cortex_committed_resource_change_api_*`
+- `cortex_committed_resource_usage_api_*`
+- `cortex_committed_resource_capacity_api_*`
 
-### Observability
+## Lifecycle Management
 
-Alerts and metrics are defined in `helm/bundles/cortex-nova/alerts/nova.alerts.yaml`. Key metric prefixes:
-- `cortex_committed_resource_change_api_*` - Change API metrics
-- `cortex_committed_resource_usage_api_*` - Usage API metrics
-- `cortex_committed_resource_capacity_api_*` - Capacity API metrics
+### State (CRDs)
+Defined in `api/v1alpha1/reservation_types.go`, which contains definitions for CR reservations and failover reservations (see [./failover-reservations.md](./failover-reservations.md)).
 
-## Architecture Overview
+A reservation CRD represents a single reservation slot on a hypervisor, which holds multiple VMs.
+A single CR entry typically refers to multiple reservation CRDs (slots).
+
+
+### CR Reservation Lifecycle
 
 ```mermaid
 flowchart LR
     subgraph State
         Res[(Reservation CRDs)]
     end
 
-    ChangeAPI[Change API]
-    UsageAPI[Usage API]
     Syncer[Syncer Task]
+    ChangeAPI[Change API]
+    CapacityAPI[Capacity API]
     Controller[Controller]
+    UsageAPI[Usage API]
     Scheduler[Scheduler API]
 
     ChangeAPI -->|CRUD| Res
     Syncer -->|CRUD| Res
     UsageAPI -->|read| Res
+    CapacityAPI -->|read| Res
+    CapacityAPI -->|capacity request| Scheduler
     Res -->|watch| Controller
     Controller -->|update spec/status| Res
-    Controller -->|placement request| Scheduler
+    Controller -->|reservation placement request| Scheduler
 ```
 
-Reservations are managed through the Change API, Syncer Task, and Controller reconciliation. The Usage API provides read-only access to report usage data back to Limes.
+Reservations are managed through the Change API, Syncer Task, and Controller reconciliation.
+
+| Component | Event | Timing | Action |
+|-----------|-------|--------|--------|
+| **Change API / Syncer** | CR Create, Resize, Delete | Immediate/Hourly | Create/update/delete Reservation CRDs |
+| **Controller** | Placement | On creation | Find host via scheduler API, set `TargetHost` |
+| **Controller** | Optimize unused slots | >> minutes | Assign PAYG VMs or re-place reservations |
+
+### VM Lifecycle
+
+VM allocations are tracked within reservations:
+
+```mermaid
+flowchart LR
+    subgraph State
+        Res[(Reservation CRDs)]
+    end
+    A[Nova Scheduler] -->|VM Create/Migrate/Resize| B[Scheduling Pipeline]
+    B -->|update Spec.Allocations| Res
+    Res -->|watch| Controller
+    Res -->|periodic reconcile| Controller
+    Controller -->|update Spec/Status.Allocations| Res
+```
+
+| Component | Event | Timing | Action |
+|-----------|-------|--------|--------|
+| **Scheduling Pipeline** | VM Create, Migrate, Resize | Immediate | Add VM to `Spec.Allocations` |
+| **Controller** | Reservation CRD updated | `committedResourceRequeueIntervalGracePeriod` (default: 1 min) | Verify new VMs via Nova API; update `Status.Allocations` |
+| **Controller** | Periodic check | `committedResourceRequeueIntervalActive` (default: 5 min) | Verify established VMs via Hypervisor CRD; remove gone VMs from `Spec.Allocations` |
+
+**Allocation fields**:
+- `Spec.Allocations` — Expected VMs (written by the scheduling pipeline on placement)
+- `Status.Allocations` — Confirmed VMs (written by the controller after verifying the VM is on the expected host)
+
+**VM allocation state diagram**:
+
+The controller uses two sources to verify VM allocations, depending on how recently the VM was placed:
+- **Nova API** — used during the grace period (`committedResourceAllocationGracePeriod`, default: 15 min) where the VM may still be starting up; provides real-time host assignment
+- **Hypervisor CRD** — used for established allocations; reflects the set of instances the hypervisor operator observes on the host
+
+```mermaid
+stateDiagram-v2
+    direction LR
+    [*] --> SpecOnly : placement (create, migrate, resize)
+    SpecOnly --> Confirmed : on expected host
+    SpecOnly --> WrongHost : on different host
+    SpecOnly --> [*] : not confirmed after grace period
+    Confirmed --> WrongHost : not on HV CRD, found elsewhere
+    Confirmed --> [*] : not on HV CRD, Nova 404
+    WrongHost --> Confirmed : back on expected host
+    WrongHost --> [*] : VM gone (404)
+    WrongHost --> [*] : on wrong host > grace period
+
+    state "Spec only (grace period)" as SpecOnly
+    state "Spec + Status (on expected host)" as Confirmed
+    state "Spec + Status (host mismatch)" as WrongHost
+```
+
+**Note**: VM allocations may not consume all resources of a reservation slot. A reservation with 128 GB may have VMs totaling only 96 GB if that fits the project's needs. Allocations may exceed reservation capacity (e.g., after VM resize).
+
+### Capacity Blocking
+
+**Blocking rules by allocation state:**
+
+| State | In HV Allocation? | Reservation must block? |
+|---|---|---|
+| No allocations | — | Full `Spec.Resources` |
+| Confirmed (Spec + Status) | Yes — already subtracted | No — subtract from reservation block |
+| Spec only (not yet running) | No — not yet on host | Yes — must remain in reservation block |
+
+**Formal calculation (stable state, `Spec.TargetHost == Status.Host`):**
+
+```
+confirmed            = sum of resources for VMs in both Spec.Allocations and Status.Allocations
+spec_only_unblocked  = sum of resources for VMs in Spec.Allocations only, NOT having an active pessimistic blocking reservation on this host
+remaining            = max(0, Spec.Resources - confirmed)
+block                = max(remaining, spec_only_unblocked)
+```
+
+**Interaction with pessimistic blocking reservations:**
+
+When a VM is in flight (Nova choosing between candidates), a pessimistic blocking reservation exists on each candidate host. For any SpecOnly VM that has such a reservation on the same host, the pessimistic blocking reservation is the authority — the CR reservation must not double-count it. The `spec_only_unblocked` term excludes those VMs.
+
+See [pessimistic-blocking-reservations.md](./pessimistic-blocking-reservations.md) for the full interaction semantics.
+
+**Migration state (`Spec.TargetHost != Status.Host`):**
+
+When a reservation is being migrated to a new host, block the full `max(Spec.Resources, spec_only_unblocked)` on **both** hosts — no subtraction of confirmed VMs. VMs may be split across hosts mid-migration and the split is not reliably known from reservation data alone; conservatively blocking both hosts prevents overcommit during the transition. The over-blocking resolves once migration completes and `Spec.TargetHost == Status.Host` again.
+
+**Corner cases:**
+
+- **Confirmed VMs exceed reservation size** (e.g., after VM resize): `Spec.Resources - confirmed` goes negative. Clamp to `0` — otherwise the filter would add capacity back to the host.
+
+- **Spec-only VM larger than remaining reservation** (e.g., confirmed VMs have consumed most of the slot, and a new VM awaiting startup is larger than what remains): `remaining < spec_only_unblocked`. Block `spec_only_unblocked` — the VM will consume those resources when it starts, and they are not yet in HV Allocation.
+
+- **VM live migration within a reservation** (VM moves away from the reservation's host): handled implicitly by `hv.Status.Allocation`. Libvirt reports resource consumption on both source and target during live migration, so both hosts' `hv.Status.Allocation` already reflects the in-flight state. No special filter logic needed. The reservation controller will eventually remove the VM from the reservation once it's confirmed on the wrong host past the grace period.
 
 ### Change-Commitments API
 
-The change-commitments API receives batched commitment changes from Limes. A request can contain multiple commitment changes across different projects and flavor groups. The semantic is **all-or-nothing**: if any commitment in the batch cannot be fulfilled (e.g., insufficient capacity), the entire request is rejected and rolled back.
+The change-commitments API receives batched commitment changes from Limes and manages reservations accordingly.
+
+**Request Semantics**: A request can contain multiple commitment changes across different projects and flavor groups. The semantic is **all-or-nothing** — if any commitment in the batch cannot be fulfilled (e.g., insufficient capacity), the entire request is rejected and rolled back.
 
-Cortex performs CRUD operations on local Reservation CRDs to match the new desired state:
+**Operations**: Cortex performs CRUD operations on local Reservation CRDs to match the new desired state:
 - Creates new reservations for increased commitment amounts
-- Deletes existing reservations
-- Cortex preserves existing reservations that already have VMs allocated when possible
+- Deletes existing reservations for decreased commitments
+- Preserves existing reservations that already have VMs allocated when possible
 
 ### Syncer Task
 
-The syncer task runs periodically and fetches all commitments from Limes. It syncs the local Reservation CRD state to match Limes' view of commitments.
+The syncer task runs periodically and syncs local Reservation CRD state to match Limes' view of commitments, correcting drift from missed API calls or restarts.
 
 ### Controller (Reconciliation)
 
-The controller watches Reservation CRDs and performs reconciliation:
+The controller watches Reservation CRDs and performs two types of reconciliation:
 
-1. **For new reservations** (no target host assigned):
-   - Calls Cortex for scheduling to find a suitable host
-   - Assigns the target host and marks the reservation as Ready
+**Placement** - Finds hosts for new reservations (calls scheduler API)
 
-2. **For existing reservations** (already have a target host):
-   - Validates that allocated VMs are still on the expected host
-   - Updates allocations if VMs have migrated or been deleted
-   - Requeues for periodic revalidation
+**Allocation Verification** - Tracks VM lifecycle on reservations. VMs take time to appear on a host after scheduling, so new allocations are verified more frequently via the Nova API for real-time status, while established allocations are verified via the Hypervisor CRD:
+- New VMs (within `committedResourceAllocationGracePeriod`, default: 15 min): checked via Nova API every `committedResourceRequeueIntervalGracePeriod` (default: 1 min)
+- Established VMs: checked via Hypervisor CRD every `committedResourceRequeueIntervalActive` (default: 5 min)
+- Missing VMs: removed from `Spec.Allocations` after Nova API confirms 404
 
 ### Usage API
 
-This API reports for a given project the total committed resources and usage per flavor group. For each VM, it reports whether the VM accounts to a specific commitment or PAYG. This assignment is deterministic and may differ from the actual Cortex internal assignment used for scheduling.
+For each flavor group `X` that accepts commitments, Cortex exposes three resource types:
+- `hw_version_X_ram` — RAM in units of the smallest flavor in the group (`HandlesCommitments=true`)
+- `hw_version_X_cores` — CPU cores derived from RAM via fixed ratio (`HandlesCommitments=false`)
+- `hw_version_X_instances` — instance count (`HandlesCommitments=false`)
 
+For each VM, the API reports whether it accounts to a specific commitment or PAYG. This assignment is deterministic and may differ from the actual Cortex internal assignment used for scheduling.
@@ -11,10 +11,17 @@ import (
 
 type Config struct {
 
-	// RequeueIntervalActive is the interval for requeueing active reservations for verification.
+	// RequeueIntervalActive is the interval for requeueing active reservations for periodic verification.
 	RequeueIntervalActive time.Duration `json:"committedResourceRequeueIntervalActive"`
 	// RequeueIntervalRetry is the interval for requeueing when retrying after knowledge is not ready.
 	RequeueIntervalRetry time.Duration `json:"committedResourceRequeueIntervalRetry"`
+	// AllocationGracePeriod is the time window after a VM is allocated to a reservation
+	// during which it's expected to appear on the target host. VMs not confirmed within
+	// this period are considered stale and removed from the reservation.
+	AllocationGracePeriod time.Duration `json:"committedResourceAllocationGracePeriod"`
+	// RequeueIntervalGracePeriod is the interval for requeueing when VMs are in grace period.
+	// Shorter than RequeueIntervalActive for faster verification of new allocations.
+	RequeueIntervalGracePeriod time.Duration `json:"committedResourceRequeueIntervalGracePeriod"`
 	// PipelineDefault is the default pipeline used for scheduling committed resource reservations.
 	PipelineDefault string `json:"committedResourcePipelineDefault"`
 
@@ -68,6 +75,12 @@ func (c *Config) ApplyDefaults() {
 	if c.RequeueIntervalRetry == 0 {
 		c.RequeueIntervalRetry = defaults.RequeueIntervalRetry
 	}
+	if c.RequeueIntervalGracePeriod == 0 {
+		c.RequeueIntervalGracePeriod = defaults.RequeueIntervalGracePeriod
+	}
+	if c.AllocationGracePeriod == 0 {
+		c.AllocationGracePeriod = defaults.AllocationGracePeriod
+	}
 	if c.PipelineDefault == "" {
 		c.PipelineDefault = defaults.PipelineDefault
 	}
@@ -88,6 +101,8 @@ func DefaultConfig() Config {
 	return Config{
 		RequeueIntervalActive:                  5 * time.Minute,
 		RequeueIntervalRetry:                   1 * time.Minute,
+		RequeueIntervalGracePeriod:             1 * time.Minute,
+		AllocationGracePeriod:                  15 * time.Minute,
 		PipelineDefault:                        "kvm-general-purpose-load-balancing",
 		SchedulerURL:                           "http://localhost:8080/scheduler/nova/external",
 		ChangeAPIWatchReservationsTimeout:      10 * time.Second,