Fix VPA ASG chain race conditions and map key bugs #175
Fix VPA ASG chain race conditions and map key bugs #175geofffranks wants to merge 5 commits intodevelopfrom
Conversation
…re store.Add The VPA periodic poller races with silk-cni cmdAdd: once a container appears in the datastore, VPA tries to create its asg-* chain, but netout-* does not exist yet. Move store.Add and the VPA force-poll calls to after all chain initialization is complete. - Initialize netout/netin/IPv6 chains before writing to the datastore - Add cleanupChains/cleanupStore closures to unwind on error paths - Fix pollResp variable to avoid double-close of the same response body - Fix defer resp.Body.Close() missing in cmdDel Made-with: Cursor
…ors during container teardown During cmdDel, Garden destroys the container but VPA can race against the iptables cleanup window. Add a Deleting flag to the container datastore so the planner skips containers currently being torn down. - Add Deleting bool field to Container (omitempty for backward compat) - Add MarkForDelete to Datastore interface and Store implementation - Reorder cmdDel: MarkForDelete first, all iptables cleanup, then Delete - Planner skips containers with Deleting: true in GetASGRulesAndChains - Regenerate counterfeiter fakes for Datastore interface Made-with: Cursor
Fixes multiple bugs in vxlan-policy-agent that caused spurious errors when creating or cleaning up ASG iptables chains: - Handle Exists() errors in replaceChainRules: distinguish 'parent chain does not exist' from 'no jump rule' and return ParentChainNotReadyErr instead of silently proceeding to the first-time creation path - Skip containers with ParentChainNotReadyErr in SyncASGsForContainers instead of accumulating them as errors - Flush-and-reuse chain on 'chain already exists' in enforce() to self-heal after VPA restarts or partial failures - Fix CleanupOrphanedASGsChains to look up the correct map key (map is keyed by parent chain name, not ASG chain name) - Guard desiredChains against empty chain names from failed enforcement Made-with: Cursor
| @@ -270,15 +300,20 @@ func (e *Enforcer) enforce(logger lager.Logger, table string, parentChain string | |||
| } | |||
| } | |||
|
|
|||
| logger.Debug("insert-chain", lager.Data{"parent-chain": parentChain, "table": table, "index": 1, "rule": rules.IPTablesRule{"-j", chainName}}) | |||
| err = e.iptables.BulkInsert(table, parentChain, 1, rules.IPTablesRule{"-j", chainName}) | |||
There was a problem hiding this comment.
The current order is:
- Clear Chain
- Milliseconds / microseconds pass where there are no iptables rules for the app and the app defaults to the accept policy (I think).
- Bulk Append
There should not be a gap. If bulk append fails, the app would be left without any ASGs until the next cycle.
There was a problem hiding this comment.
I don't believe this is the case, the replaceChain is calling enforce with either a brand-new chain (on a container that hasn't started running yet), or it's calling enforce on a candidate chain that will be swapped in by adding its jump rule, removing the old chain, and renaming the candidate to the official name.
However I do think there's a bug in there where a container could get in an infinite failure loop. And also a bug where a failure during the old chain's cleanup could result in new rules being applied for a period of time, and then being briefly removed/reapplied.
TY for calliing this out!
| if parentReady, _ := e.iptables.ChainExists(c.Table, c.ParentChain); !parentReady { | ||
| logger.Info("parent-chain-not-ready", lager.Data{"parent": c.ParentChain, "error": err.Error()}) | ||
| return &ParentChainNotReadyErr{ParentChain: c.ParentChain} | ||
| } | ||
| originalChainJumpExists = false |
There was a problem hiding this comment.
I think you need to check this error? What if the chain DOES exist, but another error is returned ?(which would result in false, ERR here, but it is not really "false" it is "unknown").
…ensuring jump rules were in place and parent chains are ready
Fixes two critical bugs in replaceChainRules recovery logic: 1. Scenario 1 (Indefinite Loop): If an orphaned original chain exists without a jump rule, RenameChain fails. Now we flush and delete the orphaned chain before renaming. 2. Scenario 2 (Traffic Regression): If both jump rules exist, we used to delete the candidate chain (new rules). Now we delete the original chain (old rules) instead. Also adds tests for these recovery paths. Made-with: Cursor
Summary
Backward Compatibility
Breaking Change? no