OCPBUGS-55179: Set internalTrafficPolicy to Local on dns-default service by pkhedeka · Pull Request #474 · openshift/cluster-dns-operator

pkhedeka · 2026-05-05T07:20:02Z

Summary

DNS queries from pods on primary user-defined networks (UDNs) scatter randomly across dns-default pods on all nodes instead of being handled by the local node's dns-default pod.

UDN pods reach the dns-default service (172.30.0.10) through a path that crosses network boundaries: UDN pod → UDN cluster router → management port → default network → OVN load balancer → dns-default pod. The OVN load balancer on the default network treats dns-default as a standard ClusterIP service and distributes traffic across all backend pods cluster-wide.

Setting internalTrafficPolicy: Local restricts the EndpointSlice to contain only the node-local dns-default backend. Since dns-default runs as a DaemonSet with a pod on every node, this is safe and guarantees that DNS queries are always handled by the local pod.

The existing trafficDistribution: PreferSameNode (added in PR #457) provides a soft hint for same-node preference but does not guarantee locality for cross-network UDN traffic. internalTrafficPolicy: Local provides the hard constraint needed.

Why DNS operator, not OVN?

The DNS query scatter from primary UDN pods is not caused by an OVN routing bug. In OCP 4.14+ (interconnect mode), each node has its own zone and its own UDN cluster router with a single route — no ECMP happens at the UDN router level. The scatter occurs downstream: once UDN traffic crosses into the default network via the management port, the OVN load balancer distributes dns-default traffic across all backend pods because the service has internalTrafficPolicy: Cluster (the default). The fix belongs in the DNS operator, which owns the dns-default service spec.

Verification

make test — passed
make verify — passed
On-cluster validation on OCP 4.20 (UPI, 5 nodes) and OCP 4.21 (IPI, 5 nodes):
- Baseline (ITP=Cluster): DNS queries from UDN pods confirmed scattering across multiple nodes
- Fix (ITP=Local): OVN LB on each node confirmed to have only the local dns-default pod as backend
- DNS resolution from UDN pods works correctly with ITP=Local

Jira: https://issues.redhat.com/browse/OCPBUGS-55179

Summary by CodeRabbit

Chores
- DNS service updated to prefer routing pod-to-pod traffic on the local node (internal traffic policy set to Local), reducing cross-node network hops.
Tests
- Added unit coverage to verify the DNS service internal traffic policy is set to Local.

openshift-ci-robot · 2026-05-05T07:20:09Z

@pkhedeka: This pull request references Jira Issue OCPBUGS-55179, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

DNS queries from pods on primary user-defined networks (UDNs) scatter randomly across dns-default pods on all nodes instead of being handled by the local node's dns-default pod.

UDN pods reach the dns-default service (172.30.0.10) through a path that crosses network boundaries: UDN pod → UDN cluster router → management port → default network → OVN load balancer → dns-default pod. The OVN load balancer on the default network treats dns-default as a standard ClusterIP service and distributes traffic across all backend pods cluster-wide.

Setting internalTrafficPolicy: Local restricts the EndpointSlice to contain only the node-local dns-default backend. Since dns-default runs as a DaemonSet with a pod on every node, this is safe and guarantees that DNS queries are always handled by the local pod.

The existing trafficDistribution: PreferSameNode (added in PR #457) provides a soft hint for same-node preference but does not guarantee locality for cross-network UDN traffic. internalTrafficPolicy: Local provides the hard constraint needed.

Why DNS operator, not OVN?

The DNS query scatter from primary UDN pods is not caused by an OVN routing bug. In OCP 4.14+ (interconnect mode), each node has its own zone and its own UDN cluster router with a single route — no ECMP happens at the UDN router level. The scatter occurs downstream: once UDN traffic crosses into the default network via the management port, the OVN load balancer distributes dns-default traffic across all backend pods because the service has internalTrafficPolicy: Cluster (the default). The fix belongs in the DNS operator, which owns the dns-default service spec.

Verification

make test — passed

make verify — passed

On-cluster validation on OCP 4.20 (UPI, 5 nodes) and OCP 4.21 (IPI, 5 nodes):

Baseline (ITP=Cluster): DNS queries from UDN pods confirmed scattering across multiple nodes

Fix (ITP=Local): OVN LB on each node confirmed to have only the local dns-default pod as backend

DNS resolution from UDN pods works correctly with ITP=Local

Jira: https://issues.redhat.com/browse/OCPBUGS-55179

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-05T07:20:18Z

Warning

Rate limit exceeded

@pkhedeka has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 51 minutes and 13 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8d6473b0-e0b4-4099-ad9f-3ad1a08ab767

📥 Commits

Reviewing files that changed from the base of the PR and between a52cbbf and f66ff49.

📒 Files selected for processing (1)

pkg/operator/controller/controller_dns_service_test.go

Walkthrough

Adds spec.internalTrafficPolicy: Local to the DNS Service manifest and a unit test asserting that desiredDNSService(...) produces a Service with Spec.InternalTrafficPolicy set to Local.

Changes

DNS Service internalTrafficPolicy

Layer / File(s)	Summary
Manifest `pkg/manifests/assets/dns/service.yaml`	Adds `spec.internalTrafficPolicy: Local` to the DNS Service manifest.
Tests `pkg/operator/controller/controller_dns_service_test.go`	New test `TestDesiredDNSServiceInternalTrafficPolicy` calls `desiredDNSService(...)` and asserts `svc.Spec.InternalTrafficPolicy` is non-nil and equals `corev1.ServiceInternalTrafficPolicyLocal`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and specifically summarizes the main change: adding internalTrafficPolicy: Local to the dns-default service. It matches the primary objective stated in the PR objectives.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-05T07:20:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign alebedev87 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-05-05T07:20:59Z

Hi @pkhedeka. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pkhedeka · 2026-05-05T07:48:20Z

/jira refresh

openshift-ci-robot · 2026-05-05T07:48:24Z

@pkhedeka: This pull request references Jira Issue OCPBUGS-55179, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

pkhedeka · 2026-05-05T08:07:36Z

/jira refresh

openshift-ci-robot · 2026-05-05T08:07:44Z

@pkhedeka: This pull request references Jira Issue OCPBUGS-55179, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

DNS queries from pods on primary user-defined networks (UDNs) scatter randomly across dns-default pods on all nodes instead of being handled by the local node's dns-default pod. UDN pods reach the dns-default service (172.30.0.10) through a path that crosses network boundaries: UDN pod -> UDN cluster router -> management port -> default network -> OVN load balancer -> dns-default pod. The OVN load balancer on the default network treats dns-default as a standard ClusterIP service and distributes traffic across all backend pods cluster-wide. Setting internalTrafficPolicy to Local restricts the EndpointSlice to contain only the node-local dns-default backend. Since dns-default runs as a DaemonSet with a pod on every node, this is safe and guarantees that DNS queries are always handled by the local pod. The existing trafficDistribution: PreferSameNode (added in PR openshift#457) provides a soft hint for same-node preference but does not guarantee locality for cross-network UDN traffic. internalTrafficPolicy: Local provides the hard constraint needed. Jira: https://issues.redhat.com/browse/OCPBUGS-55179 Signed-off-by: Parikshit Khedekar <pkhedeka@redhat.com>

openshift-ci-robot · 2026-05-05T08:58:43Z

@pkhedeka: This pull request references Jira Issue OCPBUGS-55179, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Summary

DNS queries from pods on primary user-defined networks (UDNs) scatter randomly across dns-default pods on all nodes instead of being handled by the local node's dns-default pod.

UDN pods reach the dns-default service (172.30.0.10) through a path that crosses network boundaries: UDN pod → UDN cluster router → management port → default network → OVN load balancer → dns-default pod. The OVN load balancer on the default network treats dns-default as a standard ClusterIP service and distributes traffic across all backend pods cluster-wide.

Setting internalTrafficPolicy: Local restricts the EndpointSlice to contain only the node-local dns-default backend. Since dns-default runs as a DaemonSet with a pod on every node, this is safe and guarantees that DNS queries are always handled by the local pod.

The existing trafficDistribution: PreferSameNode (added in PR #457) provides a soft hint for same-node preference but does not guarantee locality for cross-network UDN traffic. internalTrafficPolicy: Local provides the hard constraint needed.

Why DNS operator, not OVN?

The DNS query scatter from primary UDN pods is not caused by an OVN routing bug. In OCP 4.14+ (interconnect mode), each node has its own zone and its own UDN cluster router with a single route — no ECMP happens at the UDN router level. The scatter occurs downstream: once UDN traffic crosses into the default network via the management port, the OVN load balancer distributes dns-default traffic across all backend pods because the service has internalTrafficPolicy: Cluster (the default). The fix belongs in the DNS operator, which owns the dns-default service spec.

Verification

make test — passed

make verify — passed

On-cluster validation on OCP 4.20 (UPI, 5 nodes) and OCP 4.21 (IPI, 5 nodes):

Baseline (ITP=Cluster): DNS queries from UDN pods confirmed scattering across multiple nodes

Fix (ITP=Local): OVN LB on each node confirmed to have only the local dns-default pod as backend

DNS resolution from UDN pods works correctly with ITP=Local

Jira: https://issues.redhat.com/browse/OCPBUGS-55179

Summary by CodeRabbit

Chores

DNS service updated to prefer routing pod-to-pod traffic on the local node (internal traffic policy set to Local), reducing cross-node network hops.

Tests

Added unit coverage to verify the DNS service internal traffic policy is set to Local.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Miciah · 2026-05-05T14:09:19Z

Setting internalTrafficPolicy: Local restricts the EndpointSlice to contain only the node-local dns-default backend. Since dns-default runs as a DaemonSet with a pod on every node, this is safe and guarantees that DNS queries are always handled by the local pod.

This is a false assumption. There are situations in which the dns-default pods do not necessarily run on every node, for example if a node is tainted, if dnses.operator.openshift.io/cluster specifies spec.nodePlacement with a custom selector, or if a node is shutting down during a cluster upgrade. We really need the behavior of prefer-local with fallback. That's why we set trafficDistribution: PreferSameNode and why ovn-kubernetes has special prefer-local-with-fallback behavior for the dns-default pods: https://github.com/openshift/ovn-kubernetes/blob/952886fd8af2ca3ecf1717a2cb69311a32f25c06/go-controller/pkg/ovn/controller/services/lb_config.go#L258-L264

Miciah · 2026-05-05T14:11:14Z

 spec:
  # clusterIP will be automatically managed.
  # selector is set at runtime
+  internalTrafficPolicy: Local


You would also need to update daemonsetConfigChanged in pkg/operator/controller/controller_dns_daemonset.go to ensure existing an daemon set got updated.

Bit confused here, I assume this shall help me keeping iTP to local.

Miciah · 2026-05-05T14:15:04Z

+	if !assert.NotNil(t, svc.Spec.InternalTrafficPolicy, "expected InternalTrafficPolicy to be set") {
+		return
+	}
+	assert.Equal(t, corev1.ServiceInternalTrafficPolicyLocal, *svc.Spec.InternalTrafficPolicy,
+		"dns-default service must use InternalTrafficPolicy Local to ensure node-local DNS resolution")


Is there a reason to define a new test rather than putting the assertions in TestDesiredDNSDaemonset?

Didn't se any existing service, likely missed on that , so rather adding thought of creating new one. Can try this, though as you have shared the suggestions above I feel it would need more hands than mine.

pkhedeka · 2026-05-05T16:52:56Z

Setting internalTrafficPolicy: Local restricts the EndpointSlice to contain only the node-local dns-default backend. Since dns-default runs as a DaemonSet with a pod on every node, this is safe and guarantees that DNS queries are always handled by the local pod.

This is a false assumption. There are situations in which the dns-default pods do not necessarily run on every node, for example if a node is tainted, if dnses.operator.openshift.io/cluster specifies spec.nodePlacement with a custom selector, or if a node is shutting down during a cluster upgrade. We really need the behavior of prefer-local with fallback. That's why we set trafficDistribution: PreferSameNode and why ovn-kubernetes has special prefer-local-with-fallback behavior for the dns-default pods: https://github.com/openshift/ovn-kubernetes/blob/952886fd8af2ca3ecf1717a2cb69311a32f25c06/go-controller/pkg/ovn/controller/services/lb_config.go#L258-L264

| Thank you for suggestions, I agree - with-fallback these tainting/upgrade or even node replacement would turn to issues. Initially I thought of OVN-K based check/fix but couldn't really conclude to that. Now even as I see "trafficDistribution: PreferSameNode" is much needed I feel I shouldn't looking at DNS operator, rather would like to hear your suggestion. Rest, this is around a UDN traffic issue - may be I can try learning something more based on your suggestions.

I read that 457 PR and considered OVN-K would add support for preferLocal, the ask is time being. So may be this isn't the moment I shall be looking for the change here :/

Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 5, 2026

openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 5, 2026

pkhedeka marked this pull request as ready for review May 5, 2026 07:41

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026

openshift-ci Bot requested review from alebedev87 and rfredette May 5, 2026 07:42

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 5, 2026

openshift-ci Bot requested a review from anuragthehatter May 5, 2026 08:07

pkhedeka force-pushed the OCPBUGS-55179 branch from b775120 to a52cbbf Compare May 5, 2026 08:55

openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 5, 2026

Miciah reviewed May 5, 2026

View reviewed changes

Comment thread pkg/operator/controller/controller_dns_service_test.go Outdated

Miciah reviewed May 5, 2026

View reviewed changes

Comment thread pkg/operator/controller/controller_dns_service_test.go Outdated

Miciah reviewed May 5, 2026

View reviewed changes

pkhedeka and others added 2 commits May 5, 2026 23:04

Update pkg/operator/controller/controller_dns_service_test.go

b6304d3

Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>

Update pkg/operator/controller/controller_dns_service_test.go

f66ff49

Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>

Conversation

pkhedeka commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why DNS operator, not OVN?

Verification

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented May 5, 2026

Summary

Why DNS operator, not OVN?

Verification

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented May 5, 2026

Uh oh!

openshift-ci Bot commented May 5, 2026

Uh oh!

pkhedeka commented May 5, 2026

Uh oh!

openshift-ci-robot commented May 5, 2026

Uh oh!

pkhedeka commented May 5, 2026

Uh oh!

openshift-ci-robot commented May 5, 2026

Uh oh!

openshift-ci-robot commented May 5, 2026

Summary

Why DNS operator, not OVN?

Verification

Summary by CodeRabbit

Uh oh!

Miciah commented May 5, 2026

Uh oh!

Miciah May 5, 2026

Choose a reason for hiding this comment

Uh oh!

pkhedeka May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Miciah May 5, 2026

Choose a reason for hiding this comment

Uh oh!

pkhedeka May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pkhedeka commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pkhedeka commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading

pkhedeka May 5, 2026 •

edited

Loading

pkhedeka commented May 5, 2026 •

edited

Loading