Skip to content

Integrate deployment metadata service for locking and state#4856

Open
shreyas-goenka wants to merge 28 commits into
mainfrom
shreyas-goenka/deployment-metadata-service
Open

Integrate deployment metadata service for locking and state#4856
shreyas-goenka wants to merge 28 commits into
mainfrom
shreyas-goenka/deployment-metadata-service

Conversation

@shreyas-goenka
Copy link
Copy Markdown
Contributor

@shreyas-goenka shreyas-goenka commented Mar 26, 2026

Summary

Integrates the Deployment Metadata Service (DMS) as an alternative backend for deployment locking and resource state management. Gated behind DATABRICKS_BUNDLE_MANAGED_STATE=true.

When enabled:

  • Locking: Uses server-side versioned locks (with heartbeats) instead of workspace filesystem lock files
  • State: Reads/writes resource state via the DMS API (ListResources / CreateOperation) instead of local state files
  • Operations: Reports each resource operation (create, update, delete) inline to the server with resource state
  • Git provenance: Records git_info (origin_url, branch, commit) on the deployment version — the same values the CLI writes to metadata.json. Server support added in databricks-eng/universe#2009991.

Key implementation details

  • DeploymentLock interface (lock.go) with two implementations: workspaceFilesystemLock (existing behavior) and metadataServiceLock (DMS)
  • resolveDeploymentID reads deployment ID from workspace resources.json, or generates a new UUID for fresh deployments (written only after CreateDeployment succeeds)
  • LoadStateFromDMS populates the state DB from ListResources instead of reading local files
  • PushResourcesState is a no-op with DMS (state is persisted per-operation to the server)
  • --plan flag and bind/unbind are not supported with DMS
  • Heartbeat goroutine keeps the lock alive during long deployments

Test plan

  • Acceptance tests under acceptance/bundle/dms/ covering: deploy with resource creation, sequential deploys with create/delete, plan + summary, deploy errors, and lock release errors
  • Unit test for planActionToOperationAction mapping
  • E2E testing against staging workspace (32/32 passing)
  • E2E on e2-dogfood: deployed a git-backed bundle and confirmed the version's git_info round-trips through the DMS service:
"git_info": {
    "branch": "my-test-branch",
    "commit": "3bae783bc0dc303bc37a2cdfd0b2bebeeaf11e65",
    "origin_url": "https://github.com/databricks/cli-gitinfo-e2e-test.git"
}

Update: provenance + main merge

  • Merged latest main (SDK v0.141.0, Go 1.26 toolchain). Reconciled the lock-package refactor and fixed the DMS state round-trip against main's WAL-based StateDB: LoadStateFromDMS now uses OpenWithData (populates the resource-key→ID index), and the inline operation reporter reads the resource ID via GetResourceID and the state from the just-applied value, so operations carry the real resource_id and state.
  • Record git_info (origin_url, branch, commit) on the deployment version — same provenance as metadata.json (server support: databricks-eng/universe#2009991).
  • Record deployment_id/version_id on each job and pipeline's deployment block. A deploy-phase mutator (AnnotateDeploymentVersion) stamps them after the lock is acquired. version_id changes every deploy, so an ignore_local_changes rule keeps it from triggering an update on its own; real updates still send the current value via the full-config Reset/Edit.

Verified end-to-end on e2-dogfood (git_info + deployment_id/version_id round-trip) and in acceptance/bundle/dms (sequential-deploys shows an unchanged job is skipped when only its version_id bumps).

@eng-dev-ecosystem-bot
Copy link
Copy Markdown
Collaborator

eng-dev-ecosystem-bot commented Mar 26, 2026

Commit: 06061ae

Run: 26959039821

Comment thread bundle/direct/bundle_apply.go Outdated

// Report skip actions to the metadata service. On initial registration,
// these are recorded as INITIAL_REGISTER operations.
if action == deployplan.Skip && b.OperationReporter != nil {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the initial registration up

@@ -0,0 +1,6 @@
Local = true
Cloud = false
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service needs to roll out to prod before we enable this on cloud.

@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/deployment-metadata-service branch 11 times, most recently from 4bbbe9c to 7b26260 Compare April 14, 2026 21:15
assert.True(t, ok)
assert.Equal(t, tmpdms.VersionTypeDestroy, vt)

_, ok = goalToVersionType(GoalBind)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support can be added as a followup.

@shreyas-goenka shreyas-goenka marked this pull request as ready for review April 15, 2026 00:08
@shreyas-goenka shreyas-goenka requested review from andrewnester and pietern and removed request for andrewnester and pietern April 15, 2026 00:09
@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/deployment-metadata-service branch from 79b930a to 62719b0 Compare May 7, 2026 06:59
Keep resources.json maintained alongside the DMS deployment so users
have a backward path if they hit issues with the DMS-backed flow. Move
DMS-specific bookkeeping (the deployment_id that ties the bundle to a
server-side deployment record) into a sibling managed_service.json so
the two concerns stay cleanly separated.
@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/deployment-metadata-service branch 2 times, most recently from cbdb0f0 to b25b325 Compare May 8, 2026 08:10
A single async sender goroutine drains a buffered channel of operation
events; CRUD workers push onto the channel and continue. When the buffer
fills (capacity matches the worker pool), workers block on the send and
naturally back off — this is the only intended source of backpressure
on the worker pool.

Reporting is best-effort: a DMS API failure is logged and the sender
keeps draining. The deploy is no longer aborted when the audit-log
write fails. On a hard process crash, at most ~10 buffered events can
be lost (channel capacity).

Release() drains the reporter before completing the version so the audit
trail is as complete as possible on a clean shutdown.
pavloKozlov and others added 7 commits June 2, 2026 11:33
…5406)

## Why

DMS-backed bundle deployments (run with
`DATABRICKS_BUNDLE_MANAGED_STATE=true DATABRICKS_BUNDLE_ENGINE=direct`)
never set `display_name` when creating the deployment record, so the
field is stored as `null`.

## What

Populate `DisplayName` from `bundle.Config.Bundle.Name` (i.e. the
`bundle.name` from `databricks.yml`) when issuing `CreateDeployment`.
This matches the human-readable label users already see in `databricks
bundle validate`.

## Tests

Existing `acceptance/bundle/dms/*` tests record the `CreateDeployment`
request body via `print_requests.py`; their `output.txt` files
regenerate to assert the new `display_name` field.

This pull request and its description were written by Isaac.
The deployment metadata service now accepts git provenance on a version
(origin_url, branch, commit) per databricks-eng/universe#2009991. Record
it on CreateVersion using the same values the CLI writes to metadata.json.
# Conflicts:
#	bundle/deploy/lock/acquire.go
#	bundle/statemgmt/state_push.go
#	cmd/bundle/utils/process.go
#	libs/testserver/fake_workspace.go
#	libs/testserver/server.go
… for determinism

Main's direct engine applies resources concurrently, so the order of recorded
CreateOperation requests varied between runs. Add --sort to print_requests.py
in the multi-resource DMS tests to make the recorded output deterministic.
Merging main changed several APIs the DMS code predates:
- WorkspaceClient now takes a ctx (workspace_filesystem.go).
- StateDB keeps a separate resource-key->ID index (stateIDs) that is
  authoritative during writes; Data.State is only reconstructed when the WAL
  is merged. LoadStateFromDMS wrote Data.State directly, leaving the index
  empty, so deletes failed with "missing in state". It now builds the
  database and calls OpenWithData, which populates the index.
- The inline operation reporter read the freshly-created resource ID and
  state from Data.State (stale during a deploy). It now reads the ID from
  GetResourceID and the state from the value just applied, so operations
  carry the real resource_id and state and the server round-trips them.
The SDK's JobDeployment/PipelineDeployment now carry deployment_id and
version_id (used to look up deployment metadata in the DMS). Stamp them onto
each job and pipeline so every resource records the deployment and the version
that produced it.

The IDs are only known after the deployment lock is acquired, so a new
deploy-phase mutator (AnnotateDeploymentVersion) sets them, running after the
lock and before the plan. The version is plumbed onto the bundle alongside the
deployment ID.

version_id changes on every deploy, so an ignore_local_changes rule keeps it
from triggering an update on its own; a real update still sends the current
version_id via the full-config Reset/EditPipeline. (Also adjusts isAborted to
errors.AsType for the Go 1.26 linter pulled in by the merge.)
…ion_id

Operations now carry the resource_id and full state (including the deployment
block with deployment_id/version_id), and the out.test.toml dump format changed
on main. sequential-deploys now shows the version_id rule working: deploy 2
bumps the version but the unchanged test_job records no operation.
## Changes
Set `display_name` on the DMS deployment version, using the bundle name
— the same value already recorded on the deployment.

The `Version` proto has a `display_name` field, but the `CreateVersion`
request never populated it, so every version came back with a null
`display_name` even though the deployment had one. This stamps it for
parity.

## Why
`display_name` is set on the deployment (from the bundle name) but was
missing on each version, leaving version records without a
human-readable label. Filling it in keeps deployment and version
metadata consistent.

## Tests
Updated the `bundle/dms` acceptance outputs and confirmed they pass.

This pull request and its description were written by Isaac, an AI
coding agent.
## Changes
Record the bundle target deployment mode on each DMS version. Adds a
`deployment_mode` field (and the `DEPLOYMENT_MODE_DEVELOPMENT` /
`DEPLOYMENT_MODE_PRODUCTION` enum) to `tmpdms.Version`, and sets it in
the `CreateVersion` request from `bundle.mode`.

Not set on the deployment: `Deployment.deployment_mode` is derived
server-side from the most recent version's mode (output-only), so the
CLI only sets it on the version. A target with no `mode` maps to an
empty value, which is omitted (the server treats it as unspecified) — we
don't fabricate a default.

## Why
The SDK's `bundle.Version` already carries `deployment_mode` ("captured
at the time of this version"), but the CLI never populated it, so every
version recorded a null mode. This stamps it so each version records
whether it was a development or production deployment.

## Tests
Added a unit test for the mode mapping (development / production /
unset). The `bundle/dms` acceptance outputs are unchanged because those
targets don't set a mode. Verified live against a workspace: a `mode:
development` target now records `deployment_mode:
DEPLOYMENT_MODE_DEVELOPMENT` on the created version.

This pull request and its description were written by Isaac, an AI
coding agent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants