Clean up and remove stale code#10518
Conversation
Clean up the bootstrap agent in preparation for multirack support. This ended up requiring a few other significant changes in other crates. As a first step, we should recognize that the bootstrap agent dropshot API is never going to support TLS. This is why we have sprockets running over the bootstrap network in the first place. All cross sled interactions should use sprockets for security. However, we need to bootstrap the system and so we allow wicketd to talk to the local bootstrap agent on it's own sled. This occurs over the lockstep API. With some further changes, we were able to remove the bootstrap agent API altogether. Additionally, some sprockets related names were changed to enhance clarity. For historical reasons, we used `Baseboard` throughout wicket, but with the changes made to remove the bootstrap agent dropshot API, we only have access to `BaseboardId`s, as that's what the majority of the system works with. Therefore, we migrated many uses of `Baseboard` to use `BaseboardId` instead. This is a much more straightforward type to use. As part of the change to using `BaseboardId`s in wicketd, we wanted to remove optionality for it. The switch zone should always know its own Baseboard through the injected file, and so now we fail immediately if the file does not exist. In order to get a non-optional `BaseboardId` we wanted to make conversion from `Baseboard` to `BaseboardId` infallible. I had started doing this by removing the `Unknown` variant from `Baseboard` and reworking all users. When I had finished I had discovered that this changes a few APIs in a non-backwards compatible manner. As this PR is already very large, I didn't want to make that change. So I went back and re-added the `Unknown` variant, panicking if it ever gets used. I will clean this up in a follow up that also removes the bootstore which is a primary user. The bootstore itself cannot ever have an `Unknown` variant in production, so this is fine for now. To fix a chicken and egg problem where the now gone bootstrap agent api was needed for wicketd to learn its bootstrap-agent-lockstep-api address, we now plumb that address through SMF directly, since the bootstrap agent already knows it when creating the switch zone. Since rack reset doesn't actually work, I also went ahead and removed all the related code. This simplifies things immensely. We can add back working reset functionality in the future, that will work a lot more like the clean slate script. There was also an old, unused `UpdateManager`, that was only called from an unused bootstrap agent API. That has also been removed.
|
This needs to be tested on a racklette and on a4x2. |
3dc634a to
2723254
Compare
jgallagher
left a comment
There was a problem hiding this comment.
This is fantastic. It was cathartic just reading over some of these deletions.
| Gimlet { identifier: String, model: String, revision: u32 }, | ||
|
|
||
| // XXX: WARNING: All uses of this variant have been removed from the | ||
| // code base. |
There was a problem hiding this comment.
I might word this slightly more strongly: "This variant must never be instantiated. All remaining uses of it are in match statements that WILL PANIC if a value of this variant is encountered." or something like that.
| } | ||
| // Query the local bootstrap agent for all known peers | ||
| let client = bootstrap_agent_lockstep_client::Client::new( | ||
| &format!("http://[{bootstrap_agent_lockstep_address}]"), |
There was a problem hiding this comment.
I don't think this is right; this is a SocketAddr now so it'll include its own brackets and port:
| &format!("http://[{bootstrap_agent_lockstep_address}]"), | |
| &format!("http://{bootstrap_agent_lockstep_address}"), |
| log, "Failed to get baseboard IDs"; | ||
| "err" => #%err, | ||
| ); | ||
| continue; |
There was a problem hiding this comment.
I don't think this continue is right: it means we'll skip the sleep below and immediately retry on failure (turning this into a hot loop).
Separately: what should we do with sleds if we fail here? The older code had the nice-ish property that if you failed to talk to some sleds, that would be reflected by updating sleds to only include the ones we successfully contacted.
| // We don't know our own baseboard, which is a very questionable | ||
| // state to be in! For now, we will hard-code the possibly | ||
| // locations where we could be running: scrimlets can only be in | ||
| // cubbies 14 or 16, so we refuse to update either of those. |
There was a problem hiding this comment.
There's a lot to love in all the removals of this PR, but this might be at the top of my list. Thank you for getting rid of this thing I was embarrassed about even as I wrote it.
| serde_json::from_str(&baseboard_file) | ||
| .map_err(|e| CmdError::Failure(anyhow!(e)))?; | ||
|
|
||
| // TODO-correctness `Baseboard::unknown()` is slated for removal |
There was a problem hiding this comment.
"slated for removal" as of May 2023 😢
Glad it's finally happening!
| "baseboard-identifier", | ||
| "baseboard-model", | ||
| "baseboard-revision", | ||
| "boot-storage-unit", |
There was a problem hiding this comment.
I don't think we use this one here
| "bootstrap-agent-lockstep-address", | ||
| "astring", | ||
| &format!( | ||
| "[{}]:{BOOTSTRAP_AGENT_LOCKSTEP_PORT}", |
There was a problem hiding this comment.
Only because I just got bit by this somewhere else... this is correct but would be More Obviously Correct as
&SocketAddr::new(
self.inner.global_zone_bootstrap_ip,
BOOTSTRAP_AGENT_LOCKSTEP_PORT,
).to_string()| pub struct ServiceManagerInner { | ||
| log: Logger, | ||
| global_zone_bootstrap_ip: Ipv6Addr, | ||
| global_zone_bootstrap_link_local_address: Ipv6Addr, |
There was a problem hiding this comment.
I should know this but if I did I've forgotten - what's the difference between these two IPs?
| &SwitchService::Lldpd { baseboard: Baseboard::Unknown }, | ||
| &SwitchService::Lldpd { | ||
| // TODO: Why is this unknown here? | ||
| // This method seems to only get called from an HTTP handler. |
There was a problem hiding this comment.
Agreed, and also: that HTTP handler and this method will be going away under #10167, so I'd vote for doing no work here
| "baseboard_id" => %baseboard_id | ||
| ); | ||
| found = Some((ip, stream)); | ||
| break; |
There was a problem hiding this comment.
This will abort any remaining tasks in set, which introduces some async cancellation points. I think that's probably fine since these are only trying to establish connections? But wanted to note it in case it seems not fine or if we want to document that this is an intentional cancellation choice.
| "err" => #%err, | ||
| let mut found = None; | ||
| while let Some(res) = set.join_next().await { | ||
| if let Ok((ip, Ok(stream))) = res { |
There was a problem hiding this comment.
This discards errors - can we at least log them?
| if let Ok((ip, Ok(stream))) = res { | ||
| // Safety: We guarantee the format of the platform id at manufacturing time. | ||
| let baseboard_id = | ||
| platform_id_to_baseboard_id(stream.peer_platform_id().as_str()); |
There was a problem hiding this comment.
Should platform_id_to_baseboard_id() be fallible? It's got some unwrap()s inside and used to only be reachable from within the trust-quorum crate, but now we can call it with an arbitrary string. (I don't know whether at this point we can assume steam.peer_platform_id() is guaranteed to be well-formed?)
|
|
||
| let baseboard = learn_baseboard(&log).map_err(|err| { | ||
| format!("Failed to learn baseboard via device tree: {err}") | ||
| })?; |
There was a problem hiding this comment.
We .unwrap() the result of HardwareManager::new() in long_running_tasks. This is adding another way that could fail and panic - maybe worth filing an issue that we shouldn't unwrap there or should change this function to retry and not fail?
| DatasetInitialization { errors: Vec<String> }, | ||
|
|
||
| #[error("Error resetting sled: {0}")] | ||
| SledReset(String), |
There was a problem hiding this comment.
Can remove this variant too now I think
This was actually smaller than expected, and didn't require pulling out the bootstore code yet. The old type used in public APIs was just moved into sled-agent-types-versions as the "initial" version. Then Inventory was updated to use `BaseboardId` instead of `Baseboard`.
Clean up the bootstrap agent in preparation for multirack support. This
ended up requiring a few other significant changes in other crates.
As a first step, we should recognize that the bootstrap agent dropshot
API is never going to support TLS. This is why we have sprockets
running over the bootstrap network in the first place. All cross sled
interactions should use sprockets for security. However, we need to
bootstrap the system and so we allow wicketd to talk to the local
bootstrap agent on it's own sled. This occurs over the lockstep API.
With some further changes, we were able to remove the bootstrap agent
API altogether. Additionally, some sprockets related names were changed
to enhance clarity.
For historical reasons, we used
Baseboardthroughout wicket, but withthe changes made to remove the bootstrap agent dropshot API, we only
have access to
BaseboardIds, as that's what the majority of the systemworks with. Therefore, we migrated many uses of
Baseboardto useBaseboardIdinstead. This is a much more straightforward type to use.As part of the change to using
BaseboardIds in wicketd, we wanted toremove optionality for it. The switch zone should always know its own
Baseboard through the injected file, and so now we fail immediately if
the file does not exist.
In order to get a non-optional
BaseboardIdwe wanted to makeconversion from
BaseboardtoBaseboardIdinfallible. I had starteddoing this by removing the
Unknownvariant fromBaseboardandreworking all users. When I had finished I had discovered that this
changes a few APIs in a non-backwards compatible manner. As this PR is
already very large, I didn't want to make that change. So I went back
and re-added the
Unknownvariant, panicking if it ever gets used.I will clean this up in a follow up that also removes the bootstore
which is a primary user. The bootstore itself cannot ever have an
Unknownvariant in production, so this is fine for now.To fix a chicken and egg problem where the now gone bootstrap agent
api was needed for wicketd to learn its bootstrap-agent-lockstep-api
address, we now plumb that address through SMF directly, since the
bootstrap agent already knows it when creating the switch zone.
Since rack reset doesn't actually work, I also went ahead and removed
all the related code. This simplifies things immensely. We can add back
working reset functionality in the future, that will work a lot more
like the clean slate script.
There was also an old, unused
UpdateManager, that was only called froman unused bootstrap agent API. That has also been removed.