fix ProviderData.Update() semantics wrt controller-runtime client cache#2008
fix ProviderData.Update() semantics wrt controller-runtime client cache#2008multi-io wants to merge 1 commit intokubermatic:mainfrom
Conversation
…epresentation on successful updates Previously, if the modifier did not change anything, the function would still as a side effect fill the passed-in machine struct with the latest version from the controller-runtime cache, which might be arbitrarily outdated. So e.g. if you called this function twice in a row, the first call might've performed an update and filled the struct with the latest state of the machine from the cluster, but then the next call might've performed no update and filled the struct with an outdated state, potentially behind even what it was before the two calls. This could lead to hard to find bugs because callers generally don't expect these semantics and proceed working with the returned outdated struct. This changes the function so the machine struct that the caller passes in is only modified if the update succeded, in which case the structure is filled with the latest state from the cluster, post-update. If the modifier didn't change anything or the update failed, the struct is left untouched. Signed-off-by: Olaf Klischat <olaf.klischat@gmail.com>
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @multi-io. Thanks for your PR. I'm waiting for a kubermatic member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Pull request overview
Adjusts ProviderData.Update() / GetMachineUpdater() behavior so the caller-provided *Machine is only mutated when an actual update succeeds, avoiding surprising regressions caused by controller-runtime’s (potentially stale) cache on no-op updates.
Changes:
- Fetches into a separate
cachedMachineinstead of overwriting the caller’smachineearly. - Applies modifiers to
cachedMachine, performsUpdate()only when there’s a real diff, and copies back intomachineonly after a successful update.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Check if we actually change something and only update if that is the case. | ||
| unmodifiedMachine := machine.DeepCopy() | ||
| unmodifiedMachine := cachedMachine.DeepCopy() | ||
| for _, modify := range modifiers { | ||
| modify(machine) | ||
| modify(cachedMachine) | ||
| } | ||
| if equality.Semantic.DeepEqual(unmodifiedMachine, machine) { | ||
| if equality.Semantic.DeepEqual(unmodifiedMachine, cachedMachine) { |
There was a problem hiding this comment.
This change also alters modifier execution semantics: modifiers now run against cachedMachine, while the caller-provided machine remains stale until after a successful Update. Any modifier that closes over and reads machine (instead of only using the updatedMachine argument) will now make decisions on stale data and may introduce unintended diffs (e.g. pkg/cloudprovider/provider/azure/provider.go:769 checks HasFinalizer(machine, ...) inside the modifier). Consider preserving the old “modifiers see the freshly-Get()'d object” behavior by copying cachedMachine into machine before invoking modifiers and restoring the original machine value on no-op / error, or otherwise updating affected call sites and documenting that modifiers must only use their argument.
| @@ -99,20 +99,27 @@ func GetMachineUpdater(ctx context.Context, client ctrlruntimeclient.Client) Mac | |||
| // Store name here, because the machine can be nil if an update failed. | |||
There was a problem hiding this comment.
The comment says the passed-in machine can be nil, but this function dereferences machine immediately (Namespace/Name) and later calls DeepCopyInto(machine), which would panic if it were nil. Either adjust the comment to reflect the actual contract (non-nil required) or add an explicit nil check that returns a clear error.
| // Store name here, because the machine can be nil if an update failed. | |
| if machine == nil { | |
| return fmt.Errorf("machine must not be nil when applying modifiers") | |
| } | |
| // Store name here for use in the retry loop; machine is required to be non-nil. |
| if equality.Semantic.DeepEqual(unmodifiedMachine, cachedMachine) { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
This is a subtle behavior change (caller object is no longer refreshed on no-op, and only updated on successful writes). Please add targeted unit tests for GetMachineUpdater to lock in the new semantics (no mutation on no-op / on update error; mutation on successful update) and to prevent regressions in conflict-retry behavior.
What this PR does / why we need it:
This changes
ProviderData.Update()semantics to only modify the in-memory representation on successful updates.Previously, if the modifier did not change anything, the function would still as a side effect fill the passed-in machine struct with the latest version from the controller-runtime cache, which is only ever updated asynchronously in the background and might be arbitrarily outdated.
So e.g. if you called this function twice in a row, the first call might've performed an update and filled the struct with the latest state of the machine from the cluster, but then the next call might've performed no update and filled the struct with an outdated state, potentially behind even what it was before the two calls.
This could lead to hard to find bugs because callers generally don't expect these semantics and proceed working with the returned outdated struct.
This PR changes the function so the machine struct that the caller passes in is only modified if the update succeeded, in which case the structure is filled with the latest state from the cluster, post-update.
If the modifier didn't change anything or the update failed, the struct is left untouched.
**Example
I haven't checked your providers, but in our fork the Openstack provider reconciles a machine kinda as follows (each Ux represents an update of the machine resource performed using
ProviderData.Update()):clean-up-openstack-server,clean-up-openstack-portmachine-delete-finalizer,machine-node-delete-finalizerclean-up-openstack-portmetakube-machine-<UID>in Openstackclean-up-openstack-server(1., 3. and 6. ensure the same finalizers because of a technicality -- we introduced these finalizers a while ago, and older machines didn't have them yet. Anyway, while redundant, this shouldn't be a problem, but it is -- see below)
The update proceeds through the port creation in step 4, it successfully updates the machine writing the port ID annotation in step 5/U4.
Then it gets to U5, where it tries to ensure the Openstack server cleanup finalizer. That finalizer already exists on the machine, usually from U1 in the same reconcile run. The cache happens to already contain U1, but not U4. So the update routine detects no diff (since the finalizer is already in the cached version) and thus it does not send an update request and changes the in-memory
machinestruct to the cached version, i.e. one where the port ID annotation is still "".Then the controller proceeds to the Openstack server creation (step 7), which requires the port ID, which it gets from the machine struct, where it's set to "". So it calls the Openstack server creation with port ID "", which triggers an error (
Bad network format: missing 'uuid'") in the Openstack API.We had been seeing this error pop up randomly in our logs and in machine events (so users also reported them) and never understood where they came from until we dug into this update routine. Please note that we were involuntarily relying on the Openstack API to perform this error handling here. In other cases or other providers, this might actually change or corrupt data in unwanted ways.
There are probably alternative ways to handle this, like disabling the cache, or fixing all the call sites of the update routine, or maybe waiting for the cache to contain a particular state in certain situations. Maybe you've done some of those things, in which case this PR is not needed.
Which issue(s) this PR fixes:
Fixes #
What type of PR is this?
/kind bug
Special notes for your reviewer:
Does this PR introduce a user-facing change? Then add your Release Note here:
Documentation: