[fm] Add guardrails for fact payload representations by smklein · Pull Request #10508 · oxidecomputer/omicron

smklein · 2026-05-29T00:43:03Z

Persisted diagnosis-engine fact payloads (the fm_fact.payload JSONB column) had no signal when their serialized shape changed. Because deployed databases already hold rows in the old shape, such a change is a data migration, and CockroachDB's schema-diff tests do not catch it (the column is unchanged JSONB; only the data changes).

Add payload-stability tests in nexus/fm/src/diagnosis:

schemars JSON-schema snapshot per DE payload (structure: variants, fields, types, and embedded-enum domains, including cross-crate types like ZpoolHealth)
a central every_de_kind_is_registered_or_exempt check over DiagnosisEngineKind so a future DE cannot silently skip the guardrail

The schemars/strum derives are cfg(test)-gated, so they stay dev-only with no production footprint. Module docs in diagnosis/mod.rs explain additive (serde-compatible) vs breaking changes and how to write a DE-scoped JSONB data migration that reuses the schema/crdb mechanism.

smklein · 2026-05-29T17:09:16Z

+//! directory, bump `KNOWN_VERSIONS` and `dbinit.sql` (see
+//! `schema/crdb/README.adoc`) — even though, from CockroachDB's point of view,
+//! the column is unchanged JSONB and only the data changes — add a case to
+//! `validate_data_migration()` in `nexus/tests/integration_tests/schema.rs`, and


I do still feel like this is perhaps the most critical piece here - validating data migrations, with:

"Shove old format into DB"
"Run migration"
"Migrated old data can deserialize as new data"

But hopefully the rest of scaffolding here helps catch "when we should do that".

I wonder if there's a nice way we could leverage some of the machinery here to make it easier to write those data migration tests. At present they involve a lot of boilerplate, but if there was a way to, say, use the generated sample types for the old and new versions which we already have, and just slap down a macro or run a CLI tool or something that says "make this into a data migration test", that would be really cool.

Persisted diagnosis-engine fact payloads (the fm_fact.payload JSONB column) had no signal when their serialized shape changed. Because deployed databases already hold rows in the old shape, such a change is a data migration, not just a code change, and CockroachDB's schema-diff tests do not catch it (the column is unchanged JSONB; only the data changes). Add payload-stability tests in nexus/fm/src/diagnosis: - schemars JSON-schema snapshot per DE payload (structure: variants, fields, types, and embedded-enum domains, including cross-crate types like ZpoolHealth) - strum-enforced serde byte sample per variant (exact on-disk bytes), plus a round-trip test proving the pinned format still deserializes - a central every_de_payload_is_pinned check over DiagnosisEngineKind so a future DE cannot silently skip the guardrail The schemars/strum derives are cfg(test)-gated, so they stay dev-only with no production footprint. Module docs in diagnosis/mod.rs explain additive (serde-compatible) vs breaking changes and how to write a DE-scoped JSONB data migration that reuses the schema/crdb mechanism.

hawkw · 2026-06-01T19:07:03Z

+//! ## Is my change a data migration?
+//!
+//! A change is **backward-compatible** if and only if every value already on
+//! disk still deserializes under the new type. Otherwise it is **breaking** and
+//! existing rows must be rewritten.
+//!
+//! Examples:
+//! - **Backward-compatible**: a new `#[serde(tag = "kind")]` variant; a new
+//!   `Option<T>` or `#[serde(default)]` field; *widening* a field's domain.
+//!   Importantly, old data can still be parsed.
+//! - **Breaking**: renaming or removing a field; changing a field's type or
+//!   value encoding; *narrowing* a domain; renaming or removing an enum variant
+//!   that may exist in old rows; changing the `kind` tag.
+//!
+//! ## Writing the migration
+//!
+//! Rewrite existing rows with a data-only `UPDATE` in a new
+//! `schema/crdb/<name>/up1.sql`. As a worked example, renaming the
+//! `last_seen_health` key to `health` in the disk DE's `zpool_unhealthy`
+//! payload:


It seems like this is suggesting that if one changes a fact payload in a way that breaks one of these tests, the only solution is to do a data migration with a complex jsonb_set expression. I feel like this could also be potentially solved by adding a new fact type that a DE can produce, and having it replace the old fact with the new one when it executes. Is it worth discussing the tradeoffs here?

Oh, I guess that's discussed at the very end of this comment. I feel like maybe it would flow better to begin with "first, try to avoid data migrations if you can get away with it. Here's how: [...]. But, if you can't avoid having to write a data migration, you have to do this: [...]"

hawkw · 2026-06-01T19:09:44Z

+    /// One obviously-fake sample per variant. The guardrail labels each sample
+    /// by the serde `kind` tag it serializes to, so there is no separate name
+    /// to keep in sync. Implementations should `match` exhaustively over their
+    /// variants, so adding a variant forces a sample to be added here
+    /// (otherwise it fails to compile).
+    fn samples() -> Vec<Self>;


Why are these necessary? I don't totally get why we need the samples in addition to using the JsonSchema derive --- what does this let us check that can't be checked using the schema? It would be nice to have an explanation of that someplace.

Also, here's a thought, which I realize may not be practical but could be worth considering: JSON Schema has a notion of examples for fields, and expectorate has an attribute to help with generating them: https://docs.rs/schemars/1.2.1/schemars/derive.JsonSchema.html#example

I wonder if we could use that instead of the hand-written samples array, somehow? If we could, it might make the user experience of adding samples slightly nicer since you could just slap the example attribute on the fields of your new payload and have it generate the example value for you. Of course, it might not be that easy, but it was just a thought.

hawkw · 2026-06-01T19:15:28Z

+/// Guardrail tests snapshotting the serialized representation of every persisted
+/// DE fact payload. See the module-level "Evolving persisted fact payloads"
+/// docs above before regenerating any golden file: a diff here means a
+/// persisted payload changed shape, which is a data migration.
+#[cfg(test)]
+mod payload_stability {


As far as I can tell, this module contains both the mechanism of the payload stability tests and the places which one is supposed to add new fact payload types and so on. It's a little bit hard to tell which is which at an immediate glance. It would be nice to have some documentation on "how do I add new DEs, and how do I add new fact payloads to a DE?".

And, maybe we should make the separation between the test machinery (which someone who is adding a new DE/fact payload type doesn't need to touch) and the test data/inputs (which you have to add your new thing to) a bit clearer? We could put the registered() function and the NO_PAYLOAD const (which IIUC are where you're supposed to add your new thing) at the very top of the module, or even put them in a separate module, to make that distinction more obvious.

hawkw · 2026-06-01T19:19:41Z

+/// sample of every variant, and the guardrail derives the schema and the
+/// serialized bytes from the type itself. The impl lives in the DE's own
+/// module, so changing a payload stays self-contained there.
+#[cfg(test)]


I kinda wonder if this actually shouldn't be cfg(test)-only? Instead, maybe we should make the CaseBuilder::add_fact function bound its argument with T: FactPayload, to ensure that you can only add new facts that implement the trait, rather than any arbitrary Serializeable thing? Otherwise, it seems pretty easy to accidentally escape the checking there.

We could make it so that just the samples() function is the part that's #[cfg(test)] gated, if we want to avoid compiling in a bunch of extra code fo rgenerating the samples.

hawkw · 2026-06-01T19:29:27Z

+#[cfg_attr(test, derive(schemars::JsonSchema, strum::EnumDiscriminants))]
+#[cfg_attr(
+    test,
+    strum_discriminants(name(DiskFactKind), derive(strum::EnumIter))
+)]


I kinda wonder if, at some point, we might wanna just go and have our own derive(FactPayload) macro or something that adds all of the requisite boilerplate? If we could figure out a way to also make that macro add the new payload type to the array of all payloads in the test, that would be really cool (but might be pretty difficult)...but, at least, it could ensure that the tagged enum repr with tag = "kind" is used, and that you derive JsonSchema and all the required strum stuff...

Not something we have to do in this PR, but as we add more DEs and fact types, I would like to make the developer experience of adding a new payload as lightweight as possible.

hawkw · 2026-06-01T19:31:05Z

+//! directory, bump `KNOWN_VERSIONS` and `dbinit.sql` (see
+//! `schema/crdb/README.adoc`) — even though, from CockroachDB's point of view,
+//! the column is unchanged JSONB and only the data changes — add a case to
+//! `validate_data_migration()` in `nexus/tests/integration_tests/schema.rs`, and


I wonder if there's a nice way we could leverage some of the machinery here to make it easier to write those data migration tests. At present they involve a lot of boilerplate, but if there was a way to, say, use the generated sample types for the old and new versions which we already have, and just slap down a macro or run a CLI tool or something that says "make this into a data migration test", that would be really cool.

smklein force-pushed the fm-fact-guardrails branch from adb0d8a to 012096a Compare May 29, 2026 00:47

smklein changed the title ~~[fm] Add guardrail pinning persisted fact payload representations~~ [fm] Add guardrails for fact payload representations May 29, 2026

smklein force-pushed the fm-fact-guardrails branch from 012096a to ddeac01 Compare May 29, 2026 15:40

smklein commented May 29, 2026

View reviewed changes

smklein force-pushed the fm-fact-guardrails branch from ddeac01 to d7de080 Compare May 29, 2026 17:41

hawkw reviewed Jun 1, 2026

View reviewed changes

smklein added 2 commits June 1, 2026 17:53

better docs, clearer guardrails, tests of test infra

26a2caf

Fix false positives, better error messages

f87f248

AlejandroME added this to the 21 milestone Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fm] Add guardrails for fact payload representations#10508

[fm] Add guardrails for fact payload representations#10508
smklein wants to merge 3 commits into
fm-disk-diagnoserfrom
fm-fact-guardrails

smklein commented May 29, 2026 •

edited

Loading

Uh oh!

smklein May 29, 2026

Uh oh!

hawkw Jun 1, 2026

Uh oh!

hawkw Jun 1, 2026

Uh oh!

hawkw Jun 1, 2026

Uh oh!

hawkw Jun 1, 2026

Uh oh!

hawkw Jun 1, 2026

Uh oh!

hawkw Jun 1, 2026

Uh oh!

hawkw Jun 1, 2026

Uh oh!

hawkw Jun 1, 2026

Uh oh!

hawkw Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

smklein commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

smklein commented May 29, 2026 •

edited

Loading