Skip to content

OCPBUGS-77532: Fix premature node schedulability marking in taint checking#470

Open
davidesalerno wants to merge 2 commits into
openshift:masterfrom
davidesalerno:ocpbugs77532
Open

OCPBUGS-77532: Fix premature node schedulability marking in taint checking#470
davidesalerno wants to merge 2 commits into
openshift:masterfrom
davidesalerno:ocpbugs77532

Conversation

@davidesalerno
Copy link
Copy Markdown
Contributor

@davidesalerno davidesalerno commented Apr 3, 2026

Summary

  • Fix tolerationsTolerateTaints() to require all taints on a node to be tolerated (not just any one) before considering the node schedulable
  • Add comprehensive unit tests for the taint-toleration matching logic

Context

The tolerationsTolerateTaints function in daemonsetUpdateIsSafe was returning true as soon as it found any single taint tolerated by any toleration. This meant a node with multiple taints (e.g. node-role.kubernetes.io/master and node.kubernetes.io/disk-pressure) would be considered schedulable even if only the master taint was tolerated, violating standard Kubernetes scheduling semantics.

The fix ensures every taint on a node must be tolerated by at least one toleration before returning true.

See also: #459 (review)

Test plan

  • Unit tests added for tolerationsTolerateTaints covering:
    • No taints (returns true)
    • Single taint tolerated / not tolerated
    • Multiple taints: all tolerated, only some tolerated (bug scenario), none tolerated
    • Wildcard toleration matching all taints
  • make build passes
  • make test passes
  • make verify passes

🤖 Assisted by Claude Code via /jira:solve OCPBUGS-77532

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Corrected taint tolerance evaluation for DNS DaemonSet pod scheduling to ensure accurate assessment of node compatibility.
  • Tests

    • Added comprehensive test coverage for taint tolerance validation across multiple scenarios.

davidesalerno and others added 2 commits April 3, 2026 16:33
…lity

OCPBUGS-77532: The tolerationsTolerateTaints function incorrectly returned
true when any single taint was tolerated by any toleration. This caused nodes
with multiple taints to be marked as schedulable even when only one taint was
tolerated, which could lead to DNS pods being scheduled on nodes where they
cannot actually run.

The fix ensures every taint on a node must be tolerated by at least one
toleration before the node is considered schedulable, matching standard
Kubernetes scheduling semantics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Davide Salerno <dsalerno@redhat.com>
Cover the key scenarios for taint-toleration matching: no taints, single
taint matched/unmatched, multiple taints with all/some/none tolerated,
and wildcard toleration. These tests verify the fix for OCPBUGS-77532
and prevent regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Davide Salerno <dsalerno@redhat.com>
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2026
@openshift-ci-robot openshift-ci-robot added the jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. label Apr 3, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 3, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 3, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@davidesalerno: This pull request references Jira Issue OCPBUGS-77532, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Fix tolerationsTolerateTaints() to require all taints on a node to be tolerated (not just any one) before considering the node schedulable
  • Add comprehensive unit tests for the taint-toleration matching logic

Context

The tolerationsTolerateTaints function in daemonsetUpdateIsSafe was returning true as soon as it found any single taint tolerated by any toleration. This meant a node with multiple taints (e.g. node-role.kubernetes.io/master and node.kubernetes.io/disk-pressure) would be considered schedulable even if only the master taint was tolerated, violating standard Kubernetes scheduling semantics.

The fix ensures every taint on a node must be tolerated by at least one toleration before returning true.

See also: #459 (review)

Test plan

  • Unit tests added for tolerationsTolerateTaints covering:
  • No taints (returns true)
  • Single taint tolerated / not tolerated
  • Multiple taints: all tolerated, only some tolerated (bug scenario), none tolerated
  • Wildcard toleration matching all taints
  • make build passes
  • make test passes
  • make verify passes

🤖 Generated with Claude Code via /jira:solve OCPBUGS-77532

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 3, 2026

Walkthrough

Updated the tolerationsTolerateTaints function to correctly evaluate whether all taints are tolerated by checking if each taint has a matching toleration, inverting the previous logic. Added comprehensive unit tests covering multiple scenarios including no taints, single and multiple taints, empty tolerations, and wildcard tolerations.

Changes

Cohort / File(s) Summary
Taint Tolerance Logic
pkg/operator/controller/controller_dns_daemonset.go
Refactored tolerationsTolerateTaints to verify all taints are tolerated by iterating through each taint and immediately returning false if any lacks a matching toleration. Removed early-exit condition for empty taints and inverted final return logic.
Test Coverage
pkg/operator/controller/controller_dns_daemonset_test.go
Added TestTolerationsTolerateTaints with comprehensive subtests covering: no taints, single tolerated/untolerated taint, multiple taints (all/partial toleration), empty tolerations list, and wildcard toleration matching.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main fix: correcting premature node schedulability marking in taint checking, which aligns with the core change to tolerationsTolerateTaints logic.
Stable And Deterministic Test Names ✅ Passed Test file uses standard Go testing framework with static, non-dynamic test names in t.Run() subtests.
Test Structure And Quality ✅ Passed Custom check for Ginkgo test code is not applicable to this standard Go unit test using testing.T framework for a utility function.
Microshift Test Compatibility ✅ Passed PR adds standard Go unit test using *testing.T, not Ginkgo e2e tests. Custom check is specific to Ginkgo e2e tests only.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The added test TestTolerationsTolerateTaints is a standard Go unit test using the testing package, not a Ginkgo e2e test. No Ginkgo test constructs are present.
Topology-Aware Scheduling Compatibility ✅ Passed The PR fixes the tolerationsTolerateTaints utility function to properly validate that all taints on a node are tolerated. No topology-incompatible scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed This PR modifies operator controller code, not OTE test binary code. No stdout writes or test suite setup functions are present in the changes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The test added is a standard Go unit test using the testing package, not a Ginkgo e2e test. The custom check applies only to Ginkgo e2e tests (It(), Describe(), Context(), When(), etc.), and no such patterns were found in this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign miciah for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@davidesalerno davidesalerno marked this pull request as ready for review April 16, 2026 10:06
@davidesalerno davidesalerno changed the title WIP: OCPBUGS-77532: Fix premature node schedulability marking in taint checking OCPBUGS-77532: Fix premature node schedulability marking in taint checking Apr 16, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2026
@openshift-ci openshift-ci Bot requested review from Miciah and gcs278 April 16, 2026 10:06
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@davidesalerno: This pull request references Jira Issue OCPBUGS-77532, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Fix tolerationsTolerateTaints() to require all taints on a node to be tolerated (not just any one) before considering the node schedulable
  • Add comprehensive unit tests for the taint-toleration matching logic

Context

The tolerationsTolerateTaints function in daemonsetUpdateIsSafe was returning true as soon as it found any single taint tolerated by any toleration. This meant a node with multiple taints (e.g. node-role.kubernetes.io/master and node.kubernetes.io/disk-pressure) would be considered schedulable even if only the master taint was tolerated, violating standard Kubernetes scheduling semantics.

The fix ensures every taint on a node must be tolerated by at least one toleration before returning true.

See also: #459 (review)

Test plan

  • Unit tests added for tolerationsTolerateTaints covering:
  • No taints (returns true)
  • Single taint tolerated / not tolerated
  • Multiple taints: all tolerated, only some tolerated (bug scenario), none tolerated
  • Wildcard toleration matching all taints
  • make build passes
  • make test passes
  • make verify passes

🤖 Assisted by Claude Code via /jira:solve OCPBUGS-77532

Summary by CodeRabbit

Release Notes

  • Bug Fixes

  • Corrected taint tolerance evaluation for DNS DaemonSet pod scheduling to ensure accurate assessment of node compatibility.

  • Tests

  • Added comprehensive test coverage for taint tolerance validation across multiple scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/operator/controller/controller_dns_daemonset_test.go (1)

891-899: Optional: add an explicit PreferNoSchedule case to lock intended behavior.

Current cases focus on NoSchedule/NoExecute. Adding one PreferNoSchedule case would document whether this helper should treat it as strictly untolerated or still acceptable for “safe update” purposes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/controller/controller_dns_daemonset_test.go` around lines 891 -
899, Add a new test case in controller_dns_daemonset_test.go alongside the
existing cases (the table that includes the case named "wildcard toleration
tolerates all taints") that exercises a taint with Effect:
corev1.TaintEffectPreferNoSchedule and a toleration set that should indicate how
the helper treats PreferNoSchedule for safe updates; specify the tolerations and
taints explicitly and add the expected boolean (true if PreferNoSchedule should
be considered tolerated for safe updates, or false if it should be treated as
untolerated), so the intended behavior for PreferNoSchedule is documented and
enforced by the test harness that uses the same test helpers and assertions as
the surrounding cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/operator/controller/controller_dns_daemonset_test.go`:
- Around line 891-899: Add a new test case in controller_dns_daemonset_test.go
alongside the existing cases (the table that includes the case named "wildcard
toleration tolerates all taints") that exercises a taint with Effect:
corev1.TaintEffectPreferNoSchedule and a toleration set that should indicate how
the helper treats PreferNoSchedule for safe updates; specify the tolerations and
taints explicitly and add the expected boolean (true if PreferNoSchedule should
be considered tolerated for safe updates, or false if it should be treated as
untolerated), so the intended behavior for PreferNoSchedule is documented and
enforced by the test harness that uses the same test helpers and assertions as
the surrounding cases.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 12ac8b68-358c-4189-a12f-696ea67649b5

📥 Commits

Reviewing files that changed from the base of the PR and between 3d21411 and 2999248.

📒 Files selected for processing (2)
  • pkg/operator/controller/controller_dns_daemonset.go
  • pkg/operator/controller/controller_dns_daemonset_test.go

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 16, 2026

@davidesalerno: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@grzpiotrowski
Copy link
Copy Markdown
Contributor

/assign

@bentito
Copy link
Copy Markdown
Contributor

bentito commented Apr 22, 2026

/assign @grzpiotrowski

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants