Skip to content

OCPBUGS-85457: Add init container for Manila node daemonset#552

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
shiftstack:manila-csi-wait-for-nfs
May 13, 2026
Merged

OCPBUGS-85457: Add init container for Manila node daemonset#552
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
shiftstack:manila-csi-wait-for-nfs

Conversation

@mandre
Copy link
Copy Markdown
Member

@mandre mandre commented May 12, 2026

The Manila CSI driver node plugin fatally exits on startup when the NFS CSI plugin socket (/var/lib/kubelet/plugins/csi-nfsplugin/csi.sock) is not yet available. After node disruptions, both DaemonSets restart concurrently and the Manila driver crashes because upstream ProbeForever only retries on DeadlineExceeded, not on Unavailable (connection refused). This causes excessive container restarts (>3), failing the CI invariant test.

In addition to an upcoming fix to make the driver more tolerant, add a wait-for-nfs-plugin init container to the Manila node DaemonSet that polls for the NFS plugin socket before allowing the csi-driver container to start. This ensures proper startup ordering between the two DaemonSets without requiring changes to the upstream driver.

The Manila CSI driver node plugin fatally exits on startup when the NFS
CSI plugin socket (/var/lib/kubelet/plugins/csi-nfsplugin/csi.sock) is
not yet available. After node disruptions, both DaemonSets restart
concurrently and the Manila driver crashes because upstream ProbeForever
only retries on DeadlineExceeded, not on Unavailable (connection
refused). This causes excessive container restarts (>3), failing the CI
invariant test.

Add a wait-for-nfs-plugin init container to the Manila node DaemonSet
that polls for the NFS plugin socket before allowing the csi-driver
container to start. This ensures proper startup ordering between the
two DaemonSets without requiring changes to the upstream driver.
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@mandre: This pull request references Jira Issue OCPBUGS-85457, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The Manila CSI driver node plugin fatally exits on startup when the NFS CSI plugin socket (/var/lib/kubelet/plugins/csi-nfsplugin/csi.sock) is not yet available. After node disruptions, both DaemonSets restart concurrently and the Manila driver crashes because upstream ProbeForever only retries on DeadlineExceeded, not on Unavailable (connection refused). This causes excessive container restarts (>3), failing the CI invariant test.

Add a wait-for-nfs-plugin init container to the Manila node DaemonSet that polls for the NFS plugin socket before allowing the csi-driver container to start. This ensures proper startup ordering between the two DaemonSets without requiring changes to the upstream driver.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

The change adds an initContainers section to a DaemonSet manifest in the OpenStack Manila overlay. A new init container named wait-for-nfs-plugin is introduced that waits for the NFS CSI plugin socket at /var/lib/kubelet/plugins/csi-nfsplugin/csi.sock to exist before the main CSI driver container starts. The init container uses the same driver image, mounts the fwd-plugin-dir volume, and is configured with restrictive security settings including a read-only root filesystem and no privilege escalation.

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Title check ✅ Passed The title 'OCPBUGS-85457: Add init container for Manila node daemonset' is fully related to the main change: adding an initContainers step to the Manila node DaemonSet that waits for the NFS plugin socket.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR only modifies Kubernetes manifests (node_add_driver.yaml). No Ginkgo tests are present or modified. The check does not apply.
Test Structure And Quality ✅ Passed Test files use standard Go testing, not Ginkgo. Custom check targets Ginkgo patterns (BeforeEach/AfterEach, It blocks), so is not applicable to this PR.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added. Changes are limited to Kubernetes manifest files and generated build artifacts. The custom check does not apply.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests added. This PR modifies only a Kubernetes YAML manifest file. The SNO compatibility check applies only when new tests are added.
Topology-Aware Scheduling Compatibility ✅ Passed DaemonSet initContainer addition has no topology-unfriendly constraints. Compatible with SNO, TNF, TNA, and HyperShift.
Ote Binary Stdout Contract ✅ Passed PR modifies only Kubernetes YAML manifest files. OTE Binary Stdout Contract applies to Go process-level code, not YAML configuration files. Check not applicable.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add any Ginkgo e2e tests. It only modifies a Kubernetes manifest file to add an initContainers section. The custom check is not applicable.
Description check ✅ Passed The PR description clearly explains the problem (Manila CSI driver crashes on startup due to missing NFS plugin socket), the root cause (upstream ProbeForever doesn't retry on Unavailable), and the solution (add init container to wait for NFS plugin socket). Description is directly related to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from gnufied and mpatlasov May 12, 2026 12:38
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
assets/overlays/openstack-manila/patches/node_add_driver.yaml (1)

104-114: ⚡ Quick win

Harden init container access to the hostPath mount.

On Line 114, the mount is writable by default even though this container only checks socket existence. Consider also disallowing privilege escalation in Line 104-107.

Suggested hardening patch
       initContainers:
         - name: wait-for-nfs-plugin
           image: ${DRIVER_IMAGE}
@@
           securityContext:
             readOnlyRootFilesystem: true
+            allowPrivilegeEscalation: false
             privileged: false
@@
           volumeMounts:
             - name: fwd-plugin-dir
               mountPath: /var/lib/kubelet/plugins/csi-nfsplugin
+              readOnly: true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/overlays/openstack-manila/patches/node_add_driver.yaml` around lines
104 - 114, Init container allows writable hostPath and privilege escalation;
tighten the init container securityContext: set the fwd-plugin-dir volumeMount
for the init container to readOnly: true and add allowPrivilegeEscalation: false
(and optionally runAsNonRoot: true) inside the same securityContext block
alongside readOnlyRootFilesystem and privileged to prevent escalation while the
init container only checks socket existence; update the securityContext and the
volumeMount entries (symbols: securityContext, readOnlyRootFilesystem,
privileged, allowPrivilegeEscalation, runAsNonRoot, volumeMounts -> name:
fwd-plugin-dir, mountPath: /var/lib/kubelet/plugins/csi-nfsplugin) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@assets/overlays/openstack-manila/patches/node_add_driver.yaml`:
- Around line 104-114: Init container allows writable hostPath and privilege
escalation; tighten the init container securityContext: set the fwd-plugin-dir
volumeMount for the init container to readOnly: true and add
allowPrivilegeEscalation: false (and optionally runAsNonRoot: true) inside the
same securityContext block alongside readOnlyRootFilesystem and privileged to
prevent escalation while the init container only checks socket existence; update
the securityContext and the volumeMount entries (symbols: securityContext,
readOnlyRootFilesystem, privileged, allowPrivilegeEscalation, runAsNonRoot,
volumeMounts -> name: fwd-plugin-dir, mountPath:
/var/lib/kubelet/plugins/csi-nfsplugin) accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d754f127-80c7-452e-ab2d-0258a0541eb7

📥 Commits

Reviewing files that changed from the base of the PR and between af78998 and 5ad4fc0.

⛔ Files ignored due to path filters (2)
  • assets/overlays/openstack-manila/generated/hypershift/node.yaml is excluded by !**/generated/**
  • assets/overlays/openstack-manila/generated/standalone/node.yaml is excluded by !**/generated/**
📒 Files selected for processing (1)
  • assets/overlays/openstack-manila/patches/node_add_driver.yaml

@mandre
Copy link
Copy Markdown
Member Author

mandre commented May 12, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@mandre: This pull request references Jira Issue OCPBUGS-85457, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 12, 2026
@mandre
Copy link
Copy Markdown
Member Author

mandre commented May 12, 2026

Infra issue...
/test e2e-openstack-manila-csi

@mandre
Copy link
Copy Markdown
Member Author

mandre commented May 12, 2026

/test e2e-openstack-manila-csi

@mandre
Copy link
Copy Markdown
Member Author

mandre commented May 13, 2026

/test hypershift-e2e-openstack-aws-csi-manila

@mandre
Copy link
Copy Markdown
Member Author

mandre commented May 13, 2026

The hypershift jobs are currently broken
/override ci/prow/hypershift-e2e-openstack-aws-csi-manila

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 13, 2026

@mandre: Overrode contexts on behalf of mandre: ci/prow/hypershift-e2e-openstack-aws-csi-manila

Details

In response to this:

The hypershift jobs are currently broken
/override ci/prow/hypershift-e2e-openstack-aws-csi-manila

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 13, 2026

@mandre: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack 5ad4fc0 link false /test e2e-openstack

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@mandre
Copy link
Copy Markdown
Member Author

mandre commented May 13, 2026

/verified later @mandre

@openshift-ci-robot openshift-ci-robot added verified-later verified Signifies that the PR passed pre-merge verification criteria labels May 13, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@mandre: This PR has been marked to be verified later by @mandre.

Details

In response to this:

/verified later @mandre

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

cpu: 10m
memory: 50Mi
terminationMessagePolicy: FallbackToLogsOnError
initContainers:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we could file this as an upstream bug and add a link to that here? We'll want to backport this so this approach makes sense for now, but I believe it would be a good idea to remove this in the future if possible.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this offline and this isn't an either-or thing: we actually want both fixes. This change to add an explicit order for containers to start is correct (and not expensive) when we know one depends on the other. Let's capture that instead (via a comment or the commit message) for future us

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reported the issue for the manila driver at kubernetes/cloud-provider-openstack#3111.

@openshift-ci-robot
Copy link
Copy Markdown

@mandre: This pull request references Jira Issue OCPBUGS-85457, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

The Manila CSI driver node plugin fatally exits on startup when the NFS CSI plugin socket (/var/lib/kubelet/plugins/csi-nfsplugin/csi.sock) is not yet available. After node disruptions, both DaemonSets restart concurrently and the Manila driver crashes because upstream ProbeForever only retries on DeadlineExceeded, not on Unavailable (connection refused). This causes excessive container restarts (>3), failing the CI invariant test.

In addition to an upcoming fix to make the driver more tolerant, add a wait-for-nfs-plugin init container to the Manila node DaemonSet that polls for the NFS plugin socket before allowing the csi-driver container to start. This ensures proper startup ordering between the two DaemonSets without requiring changes to the upstream driver.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@stephenfin
Copy link
Copy Markdown
Contributor

/approve
/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mandre, stephenfin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit 5027bfc into openshift:main May 13, 2026
22 of 23 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@mandre: Jira Issue OCPBUGS-85457: All pull requests linked via external trackers have merged:

This pull request has the verified-later tag and will need to be manually moved to VERIFIED after testing. Jira Issue OCPBUGS-85457 has been moved to the MODIFIED state.

Details

In response to this:

The Manila CSI driver node plugin fatally exits on startup when the NFS CSI plugin socket (/var/lib/kubelet/plugins/csi-nfsplugin/csi.sock) is not yet available. After node disruptions, both DaemonSets restart concurrently and the Manila driver crashes because upstream ProbeForever only retries on DeadlineExceeded, not on Unavailable (connection refused). This causes excessive container restarts (>3), failing the CI invariant test.

In addition to an upcoming fix to make the driver more tolerant, add a wait-for-nfs-plugin init container to the Manila node DaemonSet that polls for the NFS plugin socket before allowing the csi-driver container to start. This ensures proper startup ordering between the two DaemonSets without requiring changes to the upstream driver.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@stephenfin stephenfin deleted the manila-csi-wait-for-nfs branch May 13, 2026 13:03
@mandre
Copy link
Copy Markdown
Member Author

mandre commented May 13, 2026

/jira backport release-4.22,release-4.21,release-4.20

@openshift-ci-robot
Copy link
Copy Markdown

@mandre: Failed to create backported issues:

WARNING: Unexpected sprint field type []interface {} on source issue. Please update sprint manually on clone.

Details

In response to this:

/jira backport release-4.22,release-4.21,release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mandre
Copy link
Copy Markdown
Member Author

mandre commented May 13, 2026

/jira backport release-4.22,release-4.21,release-4.20

@openshift-ci-robot
Copy link
Copy Markdown

@mandre: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22
/cherrypick release-4.21
/cherrypick release-4.20

Details

In response to this:

/jira backport release-4.22,release-4.21,release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot
Copy link
Copy Markdown

@openshift-ci-robot: new pull request created: #553

Details

In response to this:

@mandre: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22
/cherrypick release-4.21
/cherrypick release-4.20

In response to this:

/jira backport release-4.22,release-4.21,release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot
Copy link
Copy Markdown

@openshift-ci-robot: new pull request created: #554

Details

In response to this:

@mandre: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22
/cherrypick release-4.21
/cherrypick release-4.20

In response to this:

/jira backport release-4.22,release-4.21,release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot
Copy link
Copy Markdown

@openshift-ci-robot: new pull request created: #555

Details

In response to this:

@mandre: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22
/cherrypick release-4.21
/cherrypick release-4.20

In response to this:

/jira backport release-4.22,release-4.21,release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria verified-later

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants