Skip to content

[fm] Add guardrails for fact payload representations#10508

Draft
smklein wants to merge 3 commits into
fm-disk-diagnoserfrom
fm-fact-guardrails
Draft

[fm] Add guardrails for fact payload representations#10508
smklein wants to merge 3 commits into
fm-disk-diagnoserfrom
fm-fact-guardrails

Conversation

@smklein
Copy link
Copy Markdown
Collaborator

@smklein smklein commented May 29, 2026

Persisted diagnosis-engine fact payloads (the fm_fact.payload JSONB column) had no signal when their serialized shape changed. Because deployed databases already hold rows in the old shape, such a change is a data migration, and CockroachDB's schema-diff tests do not catch it (the column is unchanged JSONB; only the data changes).

Add payload-stability tests in nexus/fm/src/diagnosis:

  • schemars JSON-schema snapshot per DE payload (structure: variants, fields, types, and embedded-enum domains, including cross-crate types like ZpoolHealth)
  • a central every_de_kind_is_registered_or_exempt check over DiagnosisEngineKind so a future DE cannot silently skip the guardrail

The schemars/strum derives are cfg(test)-gated, so they stay dev-only with no production footprint. Module docs in diagnosis/mod.rs explain additive (serde-compatible) vs breaking changes and how to write a DE-scoped JSONB data migration that reuses the schema/crdb mechanism.

@smklein smklein force-pushed the fm-fact-guardrails branch from adb0d8a to 012096a Compare May 29, 2026 00:47
@smklein smklein changed the title [fm] Add guardrail pinning persisted fact payload representations [fm] Add guardrails for fact payload representations May 29, 2026
@smklein smklein force-pushed the fm-fact-guardrails branch from 012096a to ddeac01 Compare May 29, 2026 15:40
//! directory, bump `KNOWN_VERSIONS` and `dbinit.sql` (see
//! `schema/crdb/README.adoc`) — even though, from CockroachDB's point of view,
//! the column is unchanged JSONB and only the data changes — add a case to
//! `validate_data_migration()` in `nexus/tests/integration_tests/schema.rs`, and
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do still feel like this is perhaps the most critical piece here - validating data migrations, with:

"Shove old format into DB"
"Run migration"
"Migrated old data can deserialize as new data"

But hopefully the rest of scaffolding here helps catch "when we should do that".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a nice way we could leverage some of the machinery here to make it easier to write those data migration tests. At present they involve a lot of boilerplate, but if there was a way to, say, use the generated sample types for the old and new versions which we already have, and just slap down a macro or run a CLI tool or something that says "make this into a data migration test", that would be really cool.

Persisted diagnosis-engine fact payloads (the fm_fact.payload JSONB column)
had no signal when their serialized shape changed. Because deployed databases
already hold rows in the old shape, such a change is a data migration, not
just a code change, and CockroachDB's schema-diff tests do not catch it (the
column is unchanged JSONB; only the data changes).

Add payload-stability tests in nexus/fm/src/diagnosis:
- schemars JSON-schema snapshot per DE payload (structure: variants, fields,
  types, and embedded-enum domains, including cross-crate types like
  ZpoolHealth)
- strum-enforced serde byte sample per variant (exact on-disk bytes), plus a
  round-trip test proving the pinned format still deserializes
- a central every_de_payload_is_pinned check over DiagnosisEngineKind so a
  future DE cannot silently skip the guardrail

The schemars/strum derives are cfg(test)-gated, so they stay dev-only with no
production footprint. Module docs in diagnosis/mod.rs explain additive
(serde-compatible) vs breaking changes and how to write a DE-scoped JSONB
data migration that reuses the schema/crdb mechanism.
@smklein smklein force-pushed the fm-fact-guardrails branch from ddeac01 to d7de080 Compare May 29, 2026 17:41
Comment on lines +20 to +39
//! ## Is my change a data migration?
//!
//! A change is **backward-compatible** if and only if every value already on
//! disk still deserializes under the new type. Otherwise it is **breaking** and
//! existing rows must be rewritten.
//!
//! Examples:
//! - **Backward-compatible**: a new `#[serde(tag = "kind")]` variant; a new
//! `Option<T>` or `#[serde(default)]` field; *widening* a field's domain.
//! Importantly, old data can still be parsed.
//! - **Breaking**: renaming or removing a field; changing a field's type or
//! value encoding; *narrowing* a domain; renaming or removing an enum variant
//! that may exist in old rows; changing the `kind` tag.
//!
//! ## Writing the migration
//!
//! Rewrite existing rows with a data-only `UPDATE` in a new
//! `schema/crdb/<name>/up1.sql`. As a worked example, renaming the
//! `last_seen_health` key to `health` in the disk DE's `zpool_unhealthy`
//! payload:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is suggesting that if one changes a fact payload in a way that breaks one of these tests, the only solution is to do a data migration with a complex jsonb_set expression. I feel like this could also be potentially solved by adding a new fact type that a DE can produce, and having it replace the old fact with the new one when it executes. Is it worth discussing the tradeoffs here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I guess that's discussed at the very end of this comment. I feel like maybe it would flow better to begin with "first, try to avoid data migrations if you can get away with it. Here's how: [...]. But, if you can't avoid having to write a data migration, you have to do this: [...]"

Comment thread nexus/fm/src/diagnosis/mod.rs Outdated
Comment on lines +137 to +142
/// One obviously-fake sample per variant. The guardrail labels each sample
/// by the serde `kind` tag it serializes to, so there is no separate name
/// to keep in sync. Implementations should `match` exhaustively over their
/// variants, so adding a variant forces a sample to be added here
/// (otherwise it fails to compile).
fn samples() -> Vec<Self>;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these necessary? I don't totally get why we need the samples in addition to using the JsonSchema derive --- what does this let us check that can't be checked using the schema? It would be nice to have an explanation of that someplace.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, here's a thought, which I realize may not be practical but could be worth considering: JSON Schema has a notion of examples for fields, and expectorate has an attribute to help with generating them: https://docs.rs/schemars/1.2.1/schemars/derive.JsonSchema.html#example

I wonder if we could use that instead of the hand-written samples array, somehow? If we could, it might make the user experience of adding samples slightly nicer since you could just slap the example attribute on the fields of your new payload and have it generate the example value for you. Of course, it might not be that easy, but it was just a thought.

Comment on lines +145 to +150
/// Guardrail tests snapshotting the serialized representation of every persisted
/// DE fact payload. See the module-level "Evolving persisted fact payloads"
/// docs above before regenerating any golden file: a diff here means a
/// persisted payload changed shape, which is a data migration.
#[cfg(test)]
mod payload_stability {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this module contains both the mechanism of the payload stability tests and the places which one is supposed to add new fact payload types and so on. It's a little bit hard to tell which is which at an immediate glance. It would be nice to have some documentation on "how do I add new DEs, and how do I add new fact payloads to a DE?".

And, maybe we should make the separation between the test machinery (which someone who is adding a new DE/fact payload type doesn't need to touch) and the test data/inputs (which you have to add your new thing to) a bit clearer? We could put the registered() function and the NO_PAYLOAD const (which IIUC are where you're supposed to add your new thing) at the very top of the module, or even put them in a separate module, to make that distinction more obvious.

/// sample of every variant, and the guardrail derives the schema and the
/// serialized bytes from the type itself. The impl lives in the DE's own
/// module, so changing a payload stays self-contained there.
#[cfg(test)]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda wonder if this actually shouldn't be cfg(test)-only? Instead, maybe we should make the CaseBuilder::add_fact function bound its argument with T: FactPayload, to ensure that you can only add new facts that implement the trait, rather than any arbitrary Serializeable thing? Otherwise, it seems pretty easy to accidentally escape the checking there.

We could make it so that just the samples() function is the part that's #[cfg(test)] gated, if we want to avoid compiling in a bunch of extra code fo rgenerating the samples.

Comment on lines +28 to +32
#[cfg_attr(test, derive(schemars::JsonSchema, strum::EnumDiscriminants))]
#[cfg_attr(
test,
strum_discriminants(name(DiskFactKind), derive(strum::EnumIter))
)]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda wonder if, at some point, we might wanna just go and have our own derive(FactPayload) macro or something that adds all of the requisite boilerplate? If we could figure out a way to also make that macro add the new payload type to the array of all payloads in the test, that would be really cool (but might be pretty difficult)...but, at least, it could ensure that the tagged enum repr with tag = "kind" is used, and that you derive JsonSchema and all the required strum stuff...

Not something we have to do in this PR, but as we add more DEs and fact types, I would like to make the developer experience of adding a new payload as lightweight as possible.

//! directory, bump `KNOWN_VERSIONS` and `dbinit.sql` (see
//! `schema/crdb/README.adoc`) — even though, from CockroachDB's point of view,
//! the column is unchanged JSONB and only the data changes — add a case to
//! `validate_data_migration()` in `nexus/tests/integration_tests/schema.rs`, and
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a nice way we could leverage some of the machinery here to make it easier to write those data migration tests. At present they involve a lot of boilerplate, but if there was a way to, say, use the generated sample types for the old and new versions which we already have, and just slap down a macro or run a CLI tool or something that says "make this into a data migration test", that would be really cool.

@AlejandroME AlejandroME added this to the 21 milestone Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants