[fm] Add guardrails for fact payload representations#10508
Conversation
adb0d8a to
012096a
Compare
012096a to
ddeac01
Compare
| //! directory, bump `KNOWN_VERSIONS` and `dbinit.sql` (see | ||
| //! `schema/crdb/README.adoc`) — even though, from CockroachDB's point of view, | ||
| //! the column is unchanged JSONB and only the data changes — add a case to | ||
| //! `validate_data_migration()` in `nexus/tests/integration_tests/schema.rs`, and |
There was a problem hiding this comment.
I do still feel like this is perhaps the most critical piece here - validating data migrations, with:
"Shove old format into DB"
"Run migration"
"Migrated old data can deserialize as new data"
But hopefully the rest of scaffolding here helps catch "when we should do that".
There was a problem hiding this comment.
I wonder if there's a nice way we could leverage some of the machinery here to make it easier to write those data migration tests. At present they involve a lot of boilerplate, but if there was a way to, say, use the generated sample types for the old and new versions which we already have, and just slap down a macro or run a CLI tool or something that says "make this into a data migration test", that would be really cool.
Persisted diagnosis-engine fact payloads (the fm_fact.payload JSONB column) had no signal when their serialized shape changed. Because deployed databases already hold rows in the old shape, such a change is a data migration, not just a code change, and CockroachDB's schema-diff tests do not catch it (the column is unchanged JSONB; only the data changes). Add payload-stability tests in nexus/fm/src/diagnosis: - schemars JSON-schema snapshot per DE payload (structure: variants, fields, types, and embedded-enum domains, including cross-crate types like ZpoolHealth) - strum-enforced serde byte sample per variant (exact on-disk bytes), plus a round-trip test proving the pinned format still deserializes - a central every_de_payload_is_pinned check over DiagnosisEngineKind so a future DE cannot silently skip the guardrail The schemars/strum derives are cfg(test)-gated, so they stay dev-only with no production footprint. Module docs in diagnosis/mod.rs explain additive (serde-compatible) vs breaking changes and how to write a DE-scoped JSONB data migration that reuses the schema/crdb mechanism.
ddeac01 to
d7de080
Compare
| //! ## Is my change a data migration? | ||
| //! | ||
| //! A change is **backward-compatible** if and only if every value already on | ||
| //! disk still deserializes under the new type. Otherwise it is **breaking** and | ||
| //! existing rows must be rewritten. | ||
| //! | ||
| //! Examples: | ||
| //! - **Backward-compatible**: a new `#[serde(tag = "kind")]` variant; a new | ||
| //! `Option<T>` or `#[serde(default)]` field; *widening* a field's domain. | ||
| //! Importantly, old data can still be parsed. | ||
| //! - **Breaking**: renaming or removing a field; changing a field's type or | ||
| //! value encoding; *narrowing* a domain; renaming or removing an enum variant | ||
| //! that may exist in old rows; changing the `kind` tag. | ||
| //! | ||
| //! ## Writing the migration | ||
| //! | ||
| //! Rewrite existing rows with a data-only `UPDATE` in a new | ||
| //! `schema/crdb/<name>/up1.sql`. As a worked example, renaming the | ||
| //! `last_seen_health` key to `health` in the disk DE's `zpool_unhealthy` | ||
| //! payload: |
There was a problem hiding this comment.
It seems like this is suggesting that if one changes a fact payload in a way that breaks one of these tests, the only solution is to do a data migration with a complex jsonb_set expression. I feel like this could also be potentially solved by adding a new fact type that a DE can produce, and having it replace the old fact with the new one when it executes. Is it worth discussing the tradeoffs here?
There was a problem hiding this comment.
Oh, I guess that's discussed at the very end of this comment. I feel like maybe it would flow better to begin with "first, try to avoid data migrations if you can get away with it. Here's how: [...]. But, if you can't avoid having to write a data migration, you have to do this: [...]"
| /// One obviously-fake sample per variant. The guardrail labels each sample | ||
| /// by the serde `kind` tag it serializes to, so there is no separate name | ||
| /// to keep in sync. Implementations should `match` exhaustively over their | ||
| /// variants, so adding a variant forces a sample to be added here | ||
| /// (otherwise it fails to compile). | ||
| fn samples() -> Vec<Self>; |
There was a problem hiding this comment.
Why are these necessary? I don't totally get why we need the samples in addition to using the JsonSchema derive --- what does this let us check that can't be checked using the schema? It would be nice to have an explanation of that someplace.
There was a problem hiding this comment.
Also, here's a thought, which I realize may not be practical but could be worth considering: JSON Schema has a notion of examples for fields, and expectorate has an attribute to help with generating them: https://docs.rs/schemars/1.2.1/schemars/derive.JsonSchema.html#example
I wonder if we could use that instead of the hand-written samples array, somehow? If we could, it might make the user experience of adding samples slightly nicer since you could just slap the example attribute on the fields of your new payload and have it generate the example value for you. Of course, it might not be that easy, but it was just a thought.
| /// Guardrail tests snapshotting the serialized representation of every persisted | ||
| /// DE fact payload. See the module-level "Evolving persisted fact payloads" | ||
| /// docs above before regenerating any golden file: a diff here means a | ||
| /// persisted payload changed shape, which is a data migration. | ||
| #[cfg(test)] | ||
| mod payload_stability { |
There was a problem hiding this comment.
As far as I can tell, this module contains both the mechanism of the payload stability tests and the places which one is supposed to add new fact payload types and so on. It's a little bit hard to tell which is which at an immediate glance. It would be nice to have some documentation on "how do I add new DEs, and how do I add new fact payloads to a DE?".
And, maybe we should make the separation between the test machinery (which someone who is adding a new DE/fact payload type doesn't need to touch) and the test data/inputs (which you have to add your new thing to) a bit clearer? We could put the registered() function and the NO_PAYLOAD const (which IIUC are where you're supposed to add your new thing) at the very top of the module, or even put them in a separate module, to make that distinction more obvious.
| /// sample of every variant, and the guardrail derives the schema and the | ||
| /// serialized bytes from the type itself. The impl lives in the DE's own | ||
| /// module, so changing a payload stays self-contained there. | ||
| #[cfg(test)] |
There was a problem hiding this comment.
I kinda wonder if this actually shouldn't be cfg(test)-only? Instead, maybe we should make the CaseBuilder::add_fact function bound its argument with T: FactPayload, to ensure that you can only add new facts that implement the trait, rather than any arbitrary Serializeable thing? Otherwise, it seems pretty easy to accidentally escape the checking there.
We could make it so that just the samples() function is the part that's #[cfg(test)] gated, if we want to avoid compiling in a bunch of extra code fo rgenerating the samples.
| #[cfg_attr(test, derive(schemars::JsonSchema, strum::EnumDiscriminants))] | ||
| #[cfg_attr( | ||
| test, | ||
| strum_discriminants(name(DiskFactKind), derive(strum::EnumIter)) | ||
| )] |
There was a problem hiding this comment.
I kinda wonder if, at some point, we might wanna just go and have our own derive(FactPayload) macro or something that adds all of the requisite boilerplate? If we could figure out a way to also make that macro add the new payload type to the array of all payloads in the test, that would be really cool (but might be pretty difficult)...but, at least, it could ensure that the tagged enum repr with tag = "kind" is used, and that you derive JsonSchema and all the required strum stuff...
Not something we have to do in this PR, but as we add more DEs and fact types, I would like to make the developer experience of adding a new payload as lightweight as possible.
| //! directory, bump `KNOWN_VERSIONS` and `dbinit.sql` (see | ||
| //! `schema/crdb/README.adoc`) — even though, from CockroachDB's point of view, | ||
| //! the column is unchanged JSONB and only the data changes — add a case to | ||
| //! `validate_data_migration()` in `nexus/tests/integration_tests/schema.rs`, and |
There was a problem hiding this comment.
I wonder if there's a nice way we could leverage some of the machinery here to make it easier to write those data migration tests. At present they involve a lot of boilerplate, but if there was a way to, say, use the generated sample types for the old and new versions which we already have, and just slap down a macro or run a CLI tool or something that says "make this into a data migration test", that would be really cool.
Persisted diagnosis-engine fact payloads (the fm_fact.payload JSONB column) had no signal when their serialized shape changed. Because deployed databases already hold rows in the old shape, such a change is a data migration, and CockroachDB's schema-diff tests do not catch it (the column is unchanged JSONB; only the data changes).
Add payload-stability tests in
nexus/fm/src/diagnosis:The schemars/strum derives are cfg(test)-gated, so they stay dev-only with no production footprint. Module docs in diagnosis/mod.rs explain additive (serde-compatible) vs breaking changes and how to write a DE-scoped JSONB data migration that reuses the schema/crdb mechanism.