Skip to content

Clean up and remove stale code#10518

Open
andrewjstone wants to merge 4 commits into
wicket-multirack-join-supportfrom
bootstrap-agent-client-cleanup
Open

Clean up and remove stale code#10518
andrewjstone wants to merge 4 commits into
wicket-multirack-join-supportfrom
bootstrap-agent-client-cleanup

Conversation

@andrewjstone
Copy link
Copy Markdown
Contributor

@andrewjstone andrewjstone commented May 29, 2026

Clean up the bootstrap agent in preparation for multirack support. This
ended up requiring a few other significant changes in other crates.

As a first step, we should recognize that the bootstrap agent dropshot
API is never going to support TLS. This is why we have sprockets
running over the bootstrap network in the first place. All cross sled
interactions should use sprockets for security. However, we need to
bootstrap the system and so we allow wicketd to talk to the local
bootstrap agent on it's own sled. This occurs over the lockstep API.
With some further changes, we were able to remove the bootstrap agent
API altogether. Additionally, some sprockets related names were changed
to enhance clarity.

For historical reasons, we used Baseboard throughout wicket, but with
the changes made to remove the bootstrap agent dropshot API, we only
have access to BaseboardIds, as that's what the majority of the system
works with. Therefore, we migrated many uses of Baseboard to use
BaseboardId instead. This is a much more straightforward type to use.

As part of the change to using BaseboardIds in wicketd, we wanted to
remove optionality for it. The switch zone should always know its own
Baseboard through the injected file, and so now we fail immediately if
the file does not exist.

In order to get a non-optional BaseboardId we wanted to make
conversion from Baseboard to BaseboardId infallible. I had started
doing this by removing the Unknown variant from Baseboard and
reworking all users. When I had finished I had discovered that this
changes a few APIs in a non-backwards compatible manner. As this PR is
already very large, I didn't want to make that change. So I went back
and re-added the Unknown variant, panicking if it ever gets used.
I will clean this up in a follow up that also removes the bootstore
which is a primary user. The bootstore itself cannot ever have an
Unknown variant in production, so this is fine for now.

To fix a chicken and egg problem where the now gone bootstrap agent
api was needed for wicketd to learn its bootstrap-agent-lockstep-api
address, we now plumb that address through SMF directly, since the
bootstrap agent already knows it when creating the switch zone.

Since rack reset doesn't actually work, I also went ahead and removed
all the related code. This simplifies things immensely. We can add back
working reset functionality in the future, that will work a lot more
like the clean slate script.

There was also an old, unused UpdateManager, that was only called from
an unused bootstrap agent API. That has also been removed.

Clean up the bootstrap agent in preparation for multirack support. This
ended up requiring a few other significant changes in other crates.

As a first step, we should recognize that the bootstrap agent dropshot
API is never going to support TLS. This is why we have sprockets
running over the bootstrap network in the first place. All cross sled
interactions should use sprockets for security. However, we need to
bootstrap the system and so we allow wicketd to talk to the local
bootstrap agent on it's own sled. This occurs over the  lockstep API.
With some further changes, we were able to remove the bootstrap agent
API altogether. Additionally, some sprockets related names were changed
to enhance clarity.

For historical reasons, we used `Baseboard` throughout wicket, but with
the changes made to remove the bootstrap agent dropshot API, we only
have access to `BaseboardId`s, as that's what the majority of the system
works with. Therefore, we migrated many uses of `Baseboard` to use
`BaseboardId` instead. This is a much more straightforward type to use.

As part of the change to using `BaseboardId`s in wicketd, we wanted to
remove optionality for it. The switch zone should always know its own
Baseboard through the injected file, and so now we fail immediately if
the file does not exist.

In order to get a non-optional `BaseboardId` we wanted to make
conversion from `Baseboard` to `BaseboardId` infallible. I had started
doing this by removing the `Unknown` variant from `Baseboard` and
reworking all users. When I had finished I had discovered that this
changes a few APIs in a non-backwards compatible manner. As this PR is
already very large, I didn't want to make that change. So I went back
and re-added the `Unknown` variant, panicking if it ever gets used.
I will clean this up in a follow up that also removes the bootstore
which is a primary user. The bootstore itself cannot ever have an
`Unknown` variant in production, so this is fine for now.

To fix a chicken and egg problem where the now gone bootstrap agent
api was needed for wicketd to learn its bootstrap-agent-lockstep-api
address, we now plumb that address through SMF directly, since the
bootstrap agent already knows it when creating the switch zone.

Since rack reset doesn't actually work, I also went ahead and removed
all the related code. This simplifies things immensely. We can add back
working reset functionality in the future, that will work a lot more
like the clean slate script.

There was also an old, unused `UpdateManager`, that was only called from
an unused bootstrap agent API. That has also been removed.
@andrewjstone andrewjstone requested a review from jgallagher May 29, 2026 21:48
@andrewjstone
Copy link
Copy Markdown
Contributor Author

This needs to be tested on a racklette and on a4x2.

@andrewjstone andrewjstone force-pushed the bootstrap-agent-client-cleanup branch from 3dc634a to 2723254 Compare May 29, 2026 21:53
@andrewjstone andrewjstone self-assigned this May 30, 2026
Copy link
Copy Markdown
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic. It was cathartic just reading over some of these deletions.

Comment thread sled-hardware/types/src/lib.rs Outdated
Gimlet { identifier: String, model: String, revision: u32 },

// XXX: WARNING: All uses of this variant have been removed from the
// code base.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might word this slightly more strongly: "This variant must never be instantiated. All remaining uses of it are in match statements that WILL PANIC if a value of this variant is encountered." or something like that.

}
// Query the local bootstrap agent for all known peers
let client = bootstrap_agent_lockstep_client::Client::new(
&format!("http://[{bootstrap_agent_lockstep_address}]"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right; this is a SocketAddr now so it'll include its own brackets and port:

Suggested change
&format!("http://[{bootstrap_agent_lockstep_address}]"),
&format!("http://{bootstrap_agent_lockstep_address}"),

log, "Failed to get baseboard IDs";
"err" => #%err,
);
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this continue is right: it means we'll skip the sleep below and immediately retry on failure (turning this into a hot loop).

Separately: what should we do with sleds if we fail here? The older code had the nice-ish property that if you failed to talk to some sleds, that would be reflected by updating sleds to only include the ones we successfully contacted.

// We don't know our own baseboard, which is a very questionable
// state to be in! For now, we will hard-code the possibly
// locations where we could be running: scrimlets can only be in
// cubbies 14 or 16, so we refuse to update either of those.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot to love in all the removals of this PR, but this might be at the top of my list. Thank you for getting rid of this thing I was embarrassed about even as I wrote it.

serde_json::from_str(&baseboard_file)
.map_err(|e| CmdError::Failure(anyhow!(e)))?;

// TODO-correctness `Baseboard::unknown()` is slated for removal
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"slated for removal" as of May 2023 😢

Glad it's finally happening!

"baseboard-identifier",
"baseboard-model",
"baseboard-revision",
"boot-storage-unit",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we use this one here

"bootstrap-agent-lockstep-address",
"astring",
&format!(
"[{}]:{BOOTSTRAP_AGENT_LOCKSTEP_PORT}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only because I just got bit by this somewhere else... this is correct but would be More Obviously Correct as

&SocketAddr::new(
    self.inner.global_zone_bootstrap_ip,
    BOOTSTRAP_AGENT_LOCKSTEP_PORT,
).to_string()

pub struct ServiceManagerInner {
log: Logger,
global_zone_bootstrap_ip: Ipv6Addr,
global_zone_bootstrap_link_local_address: Ipv6Addr,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should know this but if I did I've forgotten - what's the difference between these two IPs?

&SwitchService::Lldpd { baseboard: Baseboard::Unknown },
&SwitchService::Lldpd {
// TODO: Why is this unknown here?
// This method seems to only get called from an HTTP handler.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and also: that HTTP handler and this method will be going away under #10167, so I'd vote for doing no work here

"baseboard_id" => %baseboard_id
);
found = Some((ip, stream));
break;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will abort any remaining tasks in set, which introduces some async cancellation points. I think that's probably fine since these are only trying to establish connections? But wanted to note it in case it seems not fine or if we want to document that this is an intentional cancellation choice.

"err" => #%err,
let mut found = None;
while let Some(res) = set.join_next().await {
if let Ok((ip, Ok(stream))) = res {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This discards errors - can we at least log them?

if let Ok((ip, Ok(stream))) = res {
// Safety: We guarantee the format of the platform id at manufacturing time.
let baseboard_id =
platform_id_to_baseboard_id(stream.peer_platform_id().as_str());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should platform_id_to_baseboard_id() be fallible? It's got some unwrap()s inside and used to only be reachable from within the trust-quorum crate, but now we can call it with an arbitrary string. (I don't know whether at this point we can assume steam.peer_platform_id() is guaranteed to be well-formed?)


let baseboard = learn_baseboard(&log).map_err(|err| {
format!("Failed to learn baseboard via device tree: {err}")
})?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We .unwrap() the result of HardwareManager::new() in long_running_tasks. This is adding another way that could fail and panic - maybe worth filing an issue that we shouldn't unwrap there or should change this function to retry and not fail?

DatasetInitialization { errors: Vec<String> },

#[error("Error resetting sled: {0}")]
SledReset(String),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove this variant too now I think

This was actually smaller than expected, and didn't require pulling out
the bootstore code yet. The old type used in public APIs was just moved
into sled-agent-types-versions as the "initial" version. Then Inventory
was updated to use `BaseboardId` instead of `Baseboard`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants