Skip to content

OCPBUGS-55179: Set internalTrafficPolicy to Local on dns-default service#474

Open
pkhedeka wants to merge 3 commits into
openshift:masterfrom
pkhedeka:OCPBUGS-55179
Open

OCPBUGS-55179: Set internalTrafficPolicy to Local on dns-default service#474
pkhedeka wants to merge 3 commits into
openshift:masterfrom
pkhedeka:OCPBUGS-55179

Conversation

@pkhedeka
Copy link
Copy Markdown

@pkhedeka pkhedeka commented May 5, 2026

Summary

DNS queries from pods on primary user-defined networks (UDNs) scatter randomly across dns-default pods on all nodes instead of being handled by the local node's dns-default pod.

UDN pods reach the dns-default service (172.30.0.10) through a path that crosses network boundaries: UDN pod → UDN cluster router → management port → default network → OVN load balancer → dns-default pod. The OVN load balancer on the default network treats dns-default as a standard ClusterIP service and distributes traffic across all backend pods cluster-wide.

Setting internalTrafficPolicy: Local restricts the EndpointSlice to contain only the node-local dns-default backend. Since dns-default runs as a DaemonSet with a pod on every node, this is safe and guarantees that DNS queries are always handled by the local pod.

The existing trafficDistribution: PreferSameNode (added in PR #457) provides a soft hint for same-node preference but does not guarantee locality for cross-network UDN traffic. internalTrafficPolicy: Local provides the hard constraint needed.

Why DNS operator, not OVN?

The DNS query scatter from primary UDN pods is not caused by an OVN routing bug. In OCP 4.14+ (interconnect mode), each node has its own zone and its own UDN cluster router with a single route — no ECMP happens at the UDN router level. The scatter occurs downstream: once UDN traffic crosses into the default network via the management port, the OVN load balancer distributes dns-default traffic across all backend pods because the service has internalTrafficPolicy: Cluster (the default). The fix belongs in the DNS operator, which owns the dns-default service spec.

Verification

  • make test — passed
  • make verify — passed
  • On-cluster validation on OCP 4.20 (UPI, 5 nodes) and OCP 4.21 (IPI, 5 nodes):
    • Baseline (ITP=Cluster): DNS queries from UDN pods confirmed scattering across multiple nodes
    • Fix (ITP=Local): OVN LB on each node confirmed to have only the local dns-default pod as backend
    • DNS resolution from UDN pods works correctly with ITP=Local

Jira: https://issues.redhat.com/browse/OCPBUGS-55179

Summary by CodeRabbit

  • Chores
    • DNS service updated to prefer routing pod-to-pod traffic on the local node (internal traffic policy set to Local), reducing cross-node network hops.
  • Tests
    • Added unit coverage to verify the DNS service internal traffic policy is set to Local.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@pkhedeka: This pull request references Jira Issue OCPBUGS-55179, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

DNS queries from pods on primary user-defined networks (UDNs) scatter randomly across dns-default pods on all nodes instead of being handled by the local node's dns-default pod.

UDN pods reach the dns-default service (172.30.0.10) through a path that crosses network boundaries: UDN pod → UDN cluster router → management port → default network → OVN load balancer → dns-default pod. The OVN load balancer on the default network treats dns-default as a standard ClusterIP service and distributes traffic across all backend pods cluster-wide.

Setting internalTrafficPolicy: Local restricts the EndpointSlice to contain only the node-local dns-default backend. Since dns-default runs as a DaemonSet with a pod on every node, this is safe and guarantees that DNS queries are always handled by the local pod.

The existing trafficDistribution: PreferSameNode (added in PR #457) provides a soft hint for same-node preference but does not guarantee locality for cross-network UDN traffic. internalTrafficPolicy: Local provides the hard constraint needed.

Why DNS operator, not OVN?

The DNS query scatter from primary UDN pods is not caused by an OVN routing bug. In OCP 4.14+ (interconnect mode), each node has its own zone and its own UDN cluster router with a single route — no ECMP happens at the UDN router level. The scatter occurs downstream: once UDN traffic crosses into the default network via the management port, the OVN load balancer distributes dns-default traffic across all backend pods because the service has internalTrafficPolicy: Cluster (the default). The fix belongs in the DNS operator, which owns the dns-default service spec.

Verification

  • make test — passed
  • make verify — passed
  • On-cluster validation on OCP 4.20 (UPI, 5 nodes) and OCP 4.21 (IPI, 5 nodes):
  • Baseline (ITP=Cluster): DNS queries from UDN pods confirmed scattering across multiple nodes
  • Fix (ITP=Local): OVN LB on each node confirmed to have only the local dns-default pod as backend
  • DNS resolution from UDN pods works correctly with ITP=Local

Jira: https://issues.redhat.com/browse/OCPBUGS-55179

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Warning

Rate limit exceeded

@pkhedeka has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 51 minutes and 13 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8d6473b0-e0b4-4099-ad9f-3ad1a08ab767

📥 Commits

Reviewing files that changed from the base of the PR and between a52cbbf and f66ff49.

📒 Files selected for processing (1)
  • pkg/operator/controller/controller_dns_service_test.go

Walkthrough

Adds spec.internalTrafficPolicy: Local to the DNS Service manifest and a unit test asserting that desiredDNSService(...) produces a Service with Spec.InternalTrafficPolicy set to Local.

Changes

DNS Service internalTrafficPolicy

Layer / File(s) Summary
Manifest
pkg/manifests/assets/dns/service.yaml
Adds spec.internalTrafficPolicy: Local to the DNS Service manifest.
Tests
pkg/operator/controller/controller_dns_service_test.go
New test TestDesiredDNSServiceInternalTrafficPolicy calls desiredDNSService(...) and asserts svc.Spec.InternalTrafficPolicy is non-nil and equals corev1.ServiceInternalTrafficPolicyLocal.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically summarizes the main change: adding internalTrafficPolicy: Local to the dns-default service. It matches the primary objective stated in the PR objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 5, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign alebedev87 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

Hi @pkhedeka. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@pkhedeka pkhedeka marked this pull request as ready for review May 5, 2026 07:41
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026
@openshift-ci openshift-ci Bot requested review from alebedev87 and rfredette May 5, 2026 07:42
@pkhedeka
Copy link
Copy Markdown
Author

pkhedeka commented May 5, 2026

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@pkhedeka: This pull request references Jira Issue OCPBUGS-55179, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@pkhedeka
Copy link
Copy Markdown
Author

pkhedeka commented May 5, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@pkhedeka: This pull request references Jira Issue OCPBUGS-55179, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested a review from anuragthehatter May 5, 2026 08:07
DNS queries from pods on primary user-defined networks (UDNs) scatter
randomly across dns-default pods on all nodes instead of being handled
by the local node's dns-default pod.

UDN pods reach the dns-default service (172.30.0.10) through a path
that crosses network boundaries: UDN pod -> UDN cluster router ->
management port -> default network -> OVN load balancer -> dns-default
pod. The OVN load balancer on the default network treats dns-default
as a standard ClusterIP service and distributes traffic across all
backend pods cluster-wide.

Setting internalTrafficPolicy to Local restricts the EndpointSlice to
contain only the node-local dns-default backend. Since dns-default
runs as a DaemonSet with a pod on every node, this is safe and
guarantees that DNS queries are always handled by the local pod.

The existing trafficDistribution: PreferSameNode (added in PR openshift#457)
provides a soft hint for same-node preference but does not guarantee
locality for cross-network UDN traffic. internalTrafficPolicy: Local
provides the hard constraint needed.

Jira: https://issues.redhat.com/browse/OCPBUGS-55179
Signed-off-by: Parikshit Khedekar <pkhedeka@redhat.com>
@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@pkhedeka: This pull request references Jira Issue OCPBUGS-55179, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Summary

DNS queries from pods on primary user-defined networks (UDNs) scatter randomly across dns-default pods on all nodes instead of being handled by the local node's dns-default pod.

UDN pods reach the dns-default service (172.30.0.10) through a path that crosses network boundaries: UDN pod → UDN cluster router → management port → default network → OVN load balancer → dns-default pod. The OVN load balancer on the default network treats dns-default as a standard ClusterIP service and distributes traffic across all backend pods cluster-wide.

Setting internalTrafficPolicy: Local restricts the EndpointSlice to contain only the node-local dns-default backend. Since dns-default runs as a DaemonSet with a pod on every node, this is safe and guarantees that DNS queries are always handled by the local pod.

The existing trafficDistribution: PreferSameNode (added in PR #457) provides a soft hint for same-node preference but does not guarantee locality for cross-network UDN traffic. internalTrafficPolicy: Local provides the hard constraint needed.

Why DNS operator, not OVN?

The DNS query scatter from primary UDN pods is not caused by an OVN routing bug. In OCP 4.14+ (interconnect mode), each node has its own zone and its own UDN cluster router with a single route — no ECMP happens at the UDN router level. The scatter occurs downstream: once UDN traffic crosses into the default network via the management port, the OVN load balancer distributes dns-default traffic across all backend pods because the service has internalTrafficPolicy: Cluster (the default). The fix belongs in the DNS operator, which owns the dns-default service spec.

Verification

  • make test — passed
  • make verify — passed
  • On-cluster validation on OCP 4.20 (UPI, 5 nodes) and OCP 4.21 (IPI, 5 nodes):
  • Baseline (ITP=Cluster): DNS queries from UDN pods confirmed scattering across multiple nodes
  • Fix (ITP=Local): OVN LB on each node confirmed to have only the local dns-default pod as backend
  • DNS resolution from UDN pods works correctly with ITP=Local

Jira: https://issues.redhat.com/browse/OCPBUGS-55179

Summary by CodeRabbit

  • Chores
  • DNS service updated to prefer routing pod-to-pod traffic on the local node (internal traffic policy set to Local), reducing cross-node network hops.
  • Tests
  • Added unit coverage to verify the DNS service internal traffic policy is set to Local.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Miciah
Copy link
Copy Markdown
Contributor

Miciah commented May 5, 2026

Setting internalTrafficPolicy: Local restricts the EndpointSlice to contain only the node-local dns-default backend. Since dns-default runs as a DaemonSet with a pod on every node, this is safe and guarantees that DNS queries are always handled by the local pod.

This is a false assumption. There are situations in which the dns-default pods do not necessarily run on every node, for example if a node is tainted, if dnses.operator.openshift.io/cluster specifies spec.nodePlacement with a custom selector, or if a node is shutting down during a cluster upgrade. We really need the behavior of prefer-local with fallback. That's why we set trafficDistribution: PreferSameNode and why ovn-kubernetes has special prefer-local-with-fallback behavior for the dns-default pods: https://github.com/openshift/ovn-kubernetes/blob/952886fd8af2ca3ecf1717a2cb69311a32f25c06/go-controller/pkg/ovn/controller/services/lb_config.go#L258-L264

spec:
# clusterIP will be automatically managed.
# selector is set at runtime
internalTrafficPolicy: Local
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would also need to update daemonsetConfigChanged in pkg/operator/controller/controller_dns_daemonset.go to ensure existing an daemon set got updated.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit confused here, I assume this shall help me keeping iTP to local.

Comment thread pkg/operator/controller/controller_dns_service_test.go Outdated
Comment thread pkg/operator/controller/controller_dns_service_test.go Outdated
Comment on lines +206 to +210
if !assert.NotNil(t, svc.Spec.InternalTrafficPolicy, "expected InternalTrafficPolicy to be set") {
return
}
assert.Equal(t, corev1.ServiceInternalTrafficPolicyLocal, *svc.Spec.InternalTrafficPolicy,
"dns-default service must use InternalTrafficPolicy Local to ensure node-local DNS resolution")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to define a new test rather than putting the assertions in TestDesiredDNSDaemonset?

Copy link
Copy Markdown
Author

@pkhedeka pkhedeka May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't se any existing service, likely missed on that , so rather adding thought of creating new one. Can try this, though as you have shared the suggestions above I feel it would need more hands than mine.

@pkhedeka
Copy link
Copy Markdown
Author

pkhedeka commented May 5, 2026

Setting internalTrafficPolicy: Local restricts the EndpointSlice to contain only the node-local dns-default backend. Since dns-default runs as a DaemonSet with a pod on every node, this is safe and guarantees that DNS queries are always handled by the local pod.

This is a false assumption. There are situations in which the dns-default pods do not necessarily run on every node, for example if a node is tainted, if dnses.operator.openshift.io/cluster specifies spec.nodePlacement with a custom selector, or if a node is shutting down during a cluster upgrade. We really need the behavior of prefer-local with fallback. That's why we set trafficDistribution: PreferSameNode and why ovn-kubernetes has special prefer-local-with-fallback behavior for the dns-default pods: https://github.com/openshift/ovn-kubernetes/blob/952886fd8af2ca3ecf1717a2cb69311a32f25c06/go-controller/pkg/ovn/controller/services/lb_config.go#L258-L264

| Thank you for suggestions, I agree - with-fallback these tainting/upgrade or even node replacement would turn to issues. Initially I thought of OVN-K based check/fix but couldn't really conclude to that. Now even as I see "trafficDistribution: PreferSameNode" is much needed I feel I shouldn't looking at DNS operator, rather would like to hear your suggestion. Rest, this is around a UDN traffic issue - may be I can try learning something more based on your suggestions.

  • I read that 457 PR and considered OVN-K would add support for preferLocal, the ask is time being. So may be this isn't the moment I shall be looking for the change here :/

pkhedeka and others added 2 commits May 5, 2026 23:04
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants