Add clustered migration sync for shared disks (SYNCING barrier)#403

Open

fabi200123 wants to merge 1 commit intocloudbase:masterfrom

fabi200123:cor-707

Contributor

fabi200123 commented Mar 25, 2026

Introduce TASK_STATUS_SYNCING and TASK_TYPES_REQUIRING_CLUSTER_SYNC (DEPLOY_TRANSFER_DISKS, SHUTDOWN_INSTANCE) so multi-instance transfers with base_transfer_action.clustered=True wait for all peer tasks before marking COMPLETED and advancing dependents.

Add clustered boolean on base_transfer_action (DB migration 024)
Plumb clustered through create_instances_transfer, REST transfers API, deployment creation (inherits from parent transfer)
On task_completed: set SYNCING when barrier applies; when all peers are SYNCING, finalize (for deploy: dedupe volumes_info by disk_id, leader gets replicate_disk_data=True, followers False)
ReplicateDisksTask: skip provider replicate_disks for volumes with replicate_disk_data=False and merge back in export disk order
On set_task_error: abort peer tasks stuck in SYNCING for the same type

Volumes schema already allows extra properties; replicate_disk_data is consumed by replication only (default True preserves behavior).

petrutlucian94 reviewed

View reviewed changes

coriolis/tasks/replica_tasks.py Outdated Show resolved Hide resolved


          Add clustered migration sync for shared disks (SYNCING barrier)

a4a64e0

Introduce TASK_STATUS_SYNCING and TASK_TYPES_REQUIRING_CLUSTER_SYNC
(DEPLOY_TRANSFER_DISKS, SHUTDOWN_INSTANCE) so multi-instance transfers
with base_transfer_action.clustered=True wait for all peer tasks before
marking COMPLETED and advancing dependents.

- Add clustered boolean on base_transfer_action (DB migration 024)
- Plumb clustered through create_instances_transfer, REST transfers API,
  deployment creation (inherits from parent transfer)
- On task_completed: set SYNCING when barrier applies; when all peers
  are SYNCING, finalize (for deploy: dedupe volumes_info by disk_id,
  leader gets replicate_disk_data=True, followers False)
- ReplicateDisksTask: skip provider replicate_disks for volumes with
  replicate_disk_data=False and merge back in export disk order
- On set_task_error: abort peer tasks stuck in SYNCING for the same type

Volumes schema already allows extra properties; replicate_disk_data is
consumed by replication only (default True preserves behavior).

fabi200123 force-pushed the cor-707 branch from 637cf20 to a4a64e0 Compare

March 26, 2026 13:51

Dany9966 requested changes

View reviewed changes

coriolis/conductor/rpc/client.py

                                                 notes=None, user_scripts=None,
-                                                clone_disks=True, skip_os_morphing=False):
+                                                clone_disks=True, skip_os_morphing=False,
+                                                clustered=None):

Contributor

Dany9966 Mar 30, 2026

Is this arg really needed here? You can just set it server-side.

coriolis/conductor/rpc/server.py

+                      def _get_disk_identity(disk_info):
+                          disk_info = disk_info or {}
+                          for key in ("native_id", "uuid", "path", "name", "disk_id", "id"):

Contributor

Dany9966 Mar 30, 2026

I think we should much rather standardize how volumes_info looks like (via schema), if it's not defined already. disk_id should always be the source disk identifier, while shareable should always be the prop you're looking for, therefore the provider should make these readily available to core, core shouldn't have logic on top of it just to identify which of the disks are shareable and how to identify them.

All of them should already return a disk_id as the source disk ID, but please also double-check. Anyway, it'd be best to standardize/enforce this.

coriolis/conductor/rpc/server.py

+                          for key in ("native_id", "uuid", "path", "name", "disk_id", "id"):
+                              ident = _normalize_identity(disk_info.get(key))
+                              if ident:
+                                  if key in ("disk_id", "id") and ident.isdigit():

Contributor

Dany9966 Mar 30, 2026

Is this check done because the vmware returns device keys instead of actual IDs? If that's the case, please handle this provider-side, so it returns first class disk IDs instead.

coriolis/conductor/rpc/server.py

+                                  continue
+                              owner_task = task_by_instance.get(owner)
+                              if (
+                                      owner_task

Contributor

Dany9966 Mar 30, 2026

NIT: Can you please fix this ugliness? :D There's an extra newline above that bothers me.

coriolis/conductor/rpc/server.py

+                      shared disks. Once the owner replicate task completes, any
+                      waiting (SYNCING) replicate tasks are moved back to SCHEDULED
+                      so they can continue their normal flow.
+                      """

Contributor

Dany9966 Apr 1, 2026

So as far as I understand this, this will basically block the rest of the cluster while the main one gets transferred. I mean this is fine for shared disks, but what I proposed initially was to have them all running in parallel once the syncing tasks (which was not meant to be REPLICATE_DISKS btw) are done. You can add SYNCING on this task alone while waiting for the rest of the tasks to complete (tasks like DEPLOY_REPLICA_DISKS, SHUTDOWN_INSTNANCE), that's also feasible, but please set the volumes_info accordingly, not block all of them.

What I originally envisioned was something like this:
instance1 has: root_disk1, shared_disk1;
instance2 has root_diks2, shared_disk1.

What's the point of blocking instance2 while instance1 is replicating? When you can have the following:
instance1 replicates root_disk1, shared_disk1;
instance2 replicates root_disk2, skips shared_disk1, based on the volumes_info that you can set up before launching the replicate_disks task.

Contributor

Dany9966 Apr 1, 2026

Please let me know if there's anything preventing us from doing a parallel sync, with shared_disks being transferred by only one of the instances.

coriolis/conductor/rpc/server.py

+                                  if key in ("disk_id", "id") and ident.isdigit():
+                                      continue
+                                  return ident
+                          return None

Contributor

Dany9966 Apr 1, 2026

Did we really have to have these duplicated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet