Skip to content

[2/n] [reconfigurator] test that Reconfigurator updates are in server-side topological order#10099

Open
sunshowers wants to merge 7 commits intomainfrom
sunshowers/spr/reconfigurator-test-that-reconfigurator-updates-are-in-server-side-topological-order
Open

[2/n] [reconfigurator] test that Reconfigurator updates are in server-side topological order#10099
sunshowers wants to merge 7 commits intomainfrom
sunshowers/spr/reconfigurator-test-that-reconfigurator-updates-are-in-server-side-topological-order

Conversation

@sunshowers
Copy link
Copy Markdown
Contributor

@sunshowers sunshowers commented Mar 19, 2026

Add a test which makes sure that when Reconfigurator runs an update, it is in topological order as defined by omicron-ls-apis (only considering server-side versioned APIs): for a particular producer/consumer pair, all instances of producer must be updated before the first instance of consumer.

The point of coordination between the two is a new file, dev-tools/ls-apis/tests/output/deployment-unit-dag.toml.

  • The ls-apis tool generates this file as a condensed form of api-manifest.toml with just the edges (and there's an expectorate test for it).
  • The test in nexus-reconfigurator-planning reads this file, simulates an update and builds a trace, then asserts that the update happened in topo order (and various other related properties).

This exercise found three violations:

  1. Host OS (including omicron-sled-agent) is updated before NTP, but omicron-sled-agent is a consumer of NTP. Addressed this by turning NTP into a client-side-versioned API (as discussed in an update watercooler).
  2. omicron-sled-agent depends on cockroach-admin-client during RSS. At RSS time, the system is known to be at a single, consistent version, and we mark this fact through a new rss-only dependency filter rule. (rss-only is similar to non-dag, except it doesn't check that versioned_how is client; there is a larger discussion to be had about how we may wish to define client-side versioning as applying to edges rather than nodes, but that is outside the scope of this work.)
  3. omicron-sled-agent depends on dns-service-client during RSS. We treat this the same as 2.

Depends on:

Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
@sunshowers sunshowers changed the base branch from main to sunshowers/spr/main.reconfigurator-test-that-reconfigurator-updates-are-in-server-side-topological-order March 31, 2026 00:54
@sunshowers sunshowers changed the title [reconfigurator] test that Reconfigurator updates are in server-side topological order [2/n] [reconfigurator] test that Reconfigurator updates are in server-side topological order Mar 31, 2026
sunshowers added a commit that referenced this pull request Mar 31, 2026
Previously, the DNS server depended on its own client
(dns-service-client) for two things:

* `TransientServer` for a transient in-memory server.
* The `dnsadm` binary.

This isn't a real dependency in the sense that the DNS server doesn't
use the client to call other instances of itself in normal use, so there
was an exclusion for this in `api-manifest.toml`. This exclusion was
causing issues for #10099.

Move both of these out into their own crates, and drop the exclusion
from `api-manifest.toml`. There is a small, benign change to the output
of `ls-apis apis` (captured by
`dev-tools/ls-apis/tests/api_dependencies.out`). With `--show-deps`:

```
DNS Server (client: dns-service-client)
    [...]
    consumed by: omicron-sled-agent (omicron/sled-agent) via 2 paths
        via path 1: path+file:///home/rain/dev/oxide/omicron/sled-agent#omicron-sled-agent@0.1.0
        via path 2: path+file:///home/rain/dev/oxide/omicron/dns-server/transient#transient-dns-server@0.1.0
        via path 2: path+file:///home/rain/dev/oxide/omicron/sled-agent#omicron-sled-agent@0.1.0
```
@sunshowers sunshowers changed the base branch from sunshowers/spr/main.reconfigurator-test-that-reconfigurator-updates-are-in-server-side-topological-order to main March 31, 2026 17:14
Created using spr 1.3.6-beta.1
@sunshowers sunshowers marked this pull request as ready for review March 31, 2026 19:44
@davepacheco
Copy link
Copy Markdown
Collaborator

The NTP issue is covered by #8769, which I believe will be resolved with this change.

Copy link
Copy Markdown
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I've reviewed everything except the new test itself here.

// The NTP Admin API is client-side-versioned and currently frozen. We
// allow trivial changes to go through. If we did not, we would need to
// unfreeze the API and bump the version number for trivial changes.
.allow_trivial_changes_for_latest(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that:

  • we allow the user to update the OpenAPI document for a blessed version as long as it's trivial, or
  • we allow the generated OpenAPI document to diverge from the corresponding blessed one as long as the changes are trivial?

The first one seems better for the reasons we've previously discussed not wanting to accumulate delta (it makes it harder for future people to see what their changes were) but I guess either is okay right now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the latter, matching the other client-side API versions. I would really like to preserve the property that blessed versions are immutable if at all possible.

One possibility here for future work is introducing something like a minor version (1.1, 1.2, etc), where the minor versions are always wire compatible with the corresponding major version (1.0, etc). Then, we clients can be instructed to send 1.0 in their api-version header rather than 1.1, etc. I haven't fully thought this through so it's possible there are issues with this approach.

Comment thread ntp-admin/api/src/lib.rs
Comment on lines +10 to +11
// Do not create new versions of this client-side versioned API.
// https://github.com/oxidecomputer/omicron/issues/9290
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do feel like we discussed this comment in the context of some other API and settled on the text you've got here. Is that right?

I don't personally love it because it's not quite true that you can't create new versions. If we need a new endpoint for example we do have options like adding a new version and not using it until the next release or tolerate the endpoint failing until the next release, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches the language elsewhere, yeah. I think the goal is mostly to deter people who aren't us from adding new versions.

Comment thread nexus/inventory/src/builder.rs Outdated
Comment thread dev-tools/ls-apis-shared/src/lib.rs Outdated
Comment thread dev-tools/ls-apis/src/bin/ls-apis.rs
Comment thread dev-tools/ls-apis/api-manifest.toml
Comment on lines +498 to +514
[[dependency_filter_rules]]
ancestor = "omicron-sled-agent"
client = "cockroach-admin-client"
evaluation = "rss-only"
note = """
Sled Agent only uses the Cockroach Admin client during RSS, which we don't care
about for the purpose of upgrade.
"""

[[dependency_filter_rules]]
ancestor = "omicron-sled-agent"
client = "dns-service-client"
evaluation = "rss-only"
note = """
Sled Agent only uses the DNS service client during RSS, which we don't care
about for the purpose of upgrade.
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also discussed putting RSS into its own package so that we could eliminate the risk that some other part of sled agent grows these dependencies. Do you have any idea how much work that is? Seems at least worth filing an issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked too deeply into this. I'll file an issue.

Comment thread dev-tools/ls-apis/api-manifest.toml Outdated
# The host OS includes Sled Agent, Propolis, and all the components that get
# bundled into the switch zone.
[[deployment_units]]
id = "host_os"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a second field here makes it pretty likely people will copy/paste and forget to change either the id or label.

Out of curiosity, why not use the label for this (or use this as the label)?

If we want to preserve both as distinct things:

  • Should we make deployment_units a map instead of an array? (I wonder if we could make it an iddqd map with id as the key?)
  • Should we add a check that the ids and labels are both unique?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it (for now) to make DeploymentUnitName the primary ID, though it does mean that one of the IDs is the string "Clickhouse (single-node) / Clickhouse Server (multi-node) / Clickhouse Keeper (multi-node)".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shortened this name to just ClickHouse.

Comment thread dev-tools/ls-apis/src/api_metadata.rs Outdated
}

/// Maps a `ZoneKind` to its omicron-ls-apis deployment unit ID.
fn zone_kind_to_deployment_unit(kind: ZoneKind) -> DeploymentUnitId {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This feels brittle but I'm not sure what to do about it.

I guess we could make an enum of deployment unit ids in ls-apis-shared and expect that the API manifest uses one of those?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this over video -- the test ensures that all zone kinds and deployment unit kinds are covered. We could also turn DeploymentUnitId or DeploymentUnitName into an enum which would ensure that the deployment unit is part of an allowlisted set, but that has the tradeoff that adding a new deployment unit would require editing Rust code.

Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants