OCPBUGS-85457: Add init container for Manila node daemonset#552
Conversation
The Manila CSI driver node plugin fatally exits on startup when the NFS CSI plugin socket (/var/lib/kubelet/plugins/csi-nfsplugin/csi.sock) is not yet available. After node disruptions, both DaemonSets restart concurrently and the Manila driver crashes because upstream ProbeForever only retries on DeadlineExceeded, not on Unavailable (connection refused). This causes excessive container restarts (>3), failing the CI invariant test. Add a wait-for-nfs-plugin init container to the Manila node DaemonSet that polls for the NFS plugin socket before allowing the csi-driver container to start. This ensures proper startup ordering between the two DaemonSets without requiring changes to the upstream driver.
|
@mandre: This pull request references Jira Issue OCPBUGS-85457, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
📝 WalkthroughWalkthroughThe change adds an 🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
assets/overlays/openstack-manila/patches/node_add_driver.yaml (1)
104-114: ⚡ Quick winHarden init container access to the hostPath mount.
On Line 114, the mount is writable by default even though this container only checks socket existence. Consider also disallowing privilege escalation in Line 104-107.
Suggested hardening patch
initContainers: - name: wait-for-nfs-plugin image: ${DRIVER_IMAGE} @@ securityContext: readOnlyRootFilesystem: true + allowPrivilegeEscalation: false privileged: false @@ volumeMounts: - name: fwd-plugin-dir mountPath: /var/lib/kubelet/plugins/csi-nfsplugin + readOnly: true🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@assets/overlays/openstack-manila/patches/node_add_driver.yaml` around lines 104 - 114, Init container allows writable hostPath and privilege escalation; tighten the init container securityContext: set the fwd-plugin-dir volumeMount for the init container to readOnly: true and add allowPrivilegeEscalation: false (and optionally runAsNonRoot: true) inside the same securityContext block alongside readOnlyRootFilesystem and privileged to prevent escalation while the init container only checks socket existence; update the securityContext and the volumeMount entries (symbols: securityContext, readOnlyRootFilesystem, privileged, allowPrivilegeEscalation, runAsNonRoot, volumeMounts -> name: fwd-plugin-dir, mountPath: /var/lib/kubelet/plugins/csi-nfsplugin) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@assets/overlays/openstack-manila/patches/node_add_driver.yaml`:
- Around line 104-114: Init container allows writable hostPath and privilege
escalation; tighten the init container securityContext: set the fwd-plugin-dir
volumeMount for the init container to readOnly: true and add
allowPrivilegeEscalation: false (and optionally runAsNonRoot: true) inside the
same securityContext block alongside readOnlyRootFilesystem and privileged to
prevent escalation while the init container only checks socket existence; update
the securityContext and the volumeMount entries (symbols: securityContext,
readOnlyRootFilesystem, privileged, allowPrivilegeEscalation, runAsNonRoot,
volumeMounts -> name: fwd-plugin-dir, mountPath:
/var/lib/kubelet/plugins/csi-nfsplugin) accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: d754f127-80c7-452e-ab2d-0258a0541eb7
⛔ Files ignored due to path filters (2)
assets/overlays/openstack-manila/generated/hypershift/node.yamlis excluded by!**/generated/**assets/overlays/openstack-manila/generated/standalone/node.yamlis excluded by!**/generated/**
📒 Files selected for processing (1)
assets/overlays/openstack-manila/patches/node_add_driver.yaml
|
/jira refresh |
|
@mandre: This pull request references Jira Issue OCPBUGS-85457, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Infra issue... |
|
/test e2e-openstack-manila-csi |
|
/test hypershift-e2e-openstack-aws-csi-manila |
|
The hypershift jobs are currently broken |
|
@mandre: Overrode contexts on behalf of mandre: ci/prow/hypershift-e2e-openstack-aws-csi-manila DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mandre: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/verified later @mandre |
|
@mandre: This PR has been marked to be verified later by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| cpu: 10m | ||
| memory: 50Mi | ||
| terminationMessagePolicy: FallbackToLogsOnError | ||
| initContainers: |
There was a problem hiding this comment.
Any chance we could file this as an upstream bug and add a link to that here? We'll want to backport this so this approach makes sense for now, but I believe it would be a good idea to remove this in the future if possible.
There was a problem hiding this comment.
Discussed this offline and this isn't an either-or thing: we actually want both fixes. This change to add an explicit order for containers to start is correct (and not expensive) when we know one depends on the other. Let's capture that instead (via a comment or the commit message) for future us
There was a problem hiding this comment.
I've reported the issue for the manila driver at kubernetes/cloud-provider-openstack#3111.
|
@mandre: This pull request references Jira Issue OCPBUGS-85457, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mandre, stephenfin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
5027bfc
into
openshift:main
|
@mandre: Jira Issue OCPBUGS-85457: All pull requests linked via external trackers have merged: This pull request has the DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira backport release-4.22,release-4.21,release-4.20 |
|
@mandre: Failed to create backported issues: WARNING: Unexpected sprint field type []interface {} on source issue. Please update sprint manually on clone. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira backport release-4.22,release-4.21,release-4.20 |
|
@mandre: The following backport issues have been created:
Queuing cherrypicks to the requested branches to be created after this PR merges: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-ci-robot: new pull request created: #553 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@openshift-ci-robot: new pull request created: #554 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@openshift-ci-robot: new pull request created: #555 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The Manila CSI driver node plugin fatally exits on startup when the NFS CSI plugin socket (
/var/lib/kubelet/plugins/csi-nfsplugin/csi.sock) is not yet available. After node disruptions, both DaemonSets restart concurrently and the Manila driver crashes because upstream ProbeForever only retries on DeadlineExceeded, not on Unavailable (connection refused). This causes excessive container restarts (>3), failing the CI invariant test.In addition to an upcoming fix to make the driver more tolerant, add a
wait-for-nfs-plugininit container to the Manila node DaemonSet that polls for the NFS plugin socket before allowing the csi-driver container to start. This ensures proper startup ordering between the two DaemonSets without requiring changes to the upstream driver.