Skip to content

OCPBUGS-59176: fix several failing tests in custom-dns jobs#31129

Open
sadasu wants to merge 2 commits into
openshift:mainfrom
sadasu:bug-59176
Open

OCPBUGS-59176: fix several failing tests in custom-dns jobs#31129
sadasu wants to merge 2 commits into
openshift:mainfrom
sadasu:bug-59176

Conversation

@sadasu
Copy link
Copy Markdown
Contributor

@sadasu sadasu commented May 5, 2026

This is essentially @lihongan 's PR #30131 that has just been rebased.

Some e2e tests are failing with the job "gcp-custom-dns" for featuregate "GCPClusterHostedDNSInstall" which is promoted to GA in 4.20. In the "custom-dns" cluster OpenShift will start static CoreDNS pods to provide DNS resolution for API, Internal API and Ingress services that are essential for cluster creation. After cluster deployment is completed, the customer will update their external DNS solution with the same assigned LB IP addresses used for the configuration of the internal CoreDNS instance.

The failing tests like http2 and grpc tests use dedicated ingresscontrollers, and gateway also has separated LB and dnsrecord, so the default wildcard created by the new static CoreDNS won't work for those tests.

Make below changes to fix the failing tests:

  • Update makeHTTPClient() DialContext to target LoadBalancer IP address directly when DNS is unmanaged, fixing both http2 and gatewayapi httproute tests.
  • Update grpc Dial() to target LoadBalancer IP address directly when DNS is unmanaged, fixing the grpc test.
  • Extract getLoadBalancerAddress() as a reusable helper from assertGatewayLoadbalancerReady() for use across http2, grpc and gatewayapi tests.
  • Update gatewayapi assertDNSRecordStatus() to check the Published condition per-zone based on whether DNS is managed or unmanaged.
  • Add isDNSManaged() helper that checks the DNSManaged condition on the default ingresscontroller, defaulting to managed=true when the condition is absent.
  • Add getClusterBaseDomainName() and update http2/grpc/h2spec shard ingresscontroller to generate domain based on cluster baseDomain instead of ingress domain (e.g. "e2e-test-xxx.apps.baseDomain"), avoiding custom ingresscontroller DNS overlapping with default wildcard "*.apps.baseDomain".

Summary by CodeRabbit

  • Tests
    • Router connectivity tests now validate routes using hostnames or explicit load‑balancer addresses depending on DNS management.
    • DNS readiness checks poll per zone and require Published=true only for configured DNS zones.
    • HTTP/2 and gRPC tests support targeting explicit load‑balancer addresses when DNS is unmanaged.
    • Shard/route host resolution uses the cluster base domain.
    • HTTP/gRPC client behavior and load‑balancer address retrieval unified for consistent connectivity checks.

Rebased #30131

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@sadasu: This pull request references Jira Issue OCPBUGS-59176, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.20.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Some e2e tests are failing with the job "gcp-custom-dns" for featuregate "GCPClusterHostedDNSInstall" which is promoted to GA in 4.20. In the "custom-dns" cluster OpenShift will start static CoreDNS pods to provide DNS resolution for API, Internal API and Ingress services that are essential for cluster creation. After cluster deployment is completed, the customer will update their external DNS solution with the same assigned LB IP addresses used for the configuration of the internal CoreDNS instance.

The failing tests like http2 and grpc tests use dedicated ingresscontrollers, and gateway also has separated LB and dnsrecord, so the default wildcard created by the new static CoreDNS won't work for those tests.

Make below changes to fix the failing tests:

  • Update makeHTTPClient() DialContext to target LoadBalancer IP address directly when DNS is unmanaged, fixing both http2 and gatewayapi httproute tests.
  • Update grpc Dial() to target LoadBalancer IP address directly when DNS is unmanaged, fixing the grpc test.
  • Extract getLoadBalancerAddress() as a reusable helper from assertGatewayLoadbalancerReady() for use across http2, grpc and gatewayapi tests.
  • Update gatewayapi assertDNSRecordStatus() to check the Published condition per-zone based on whether DNS is managed or unmanaged.
  • Add isDNSManaged() helper that checks the DNSManaged condition on the default ingresscontroller, defaulting to managed=true when the condition is absent.
  • Add getClusterBaseDomainName() and update http2/grpc/h2spec shard ingresscontroller to generate domain based on cluster baseDomain instead of ingress domain (e.g. "e2e-test-xxx.apps.baseDomain"), avoiding custom ingresscontroller DNS overlapping with default wildcard "*.apps.baseDomain".

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 5, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Walkthrough

Tests and helpers were changed to prefer the cluster base domain, detect whether cluster DNS is managed, resolve the gateway load‑balancer address only when DNS is unmanaged, and route HTTP/gRPC connections via DNS or by dialing the LB address accordingly.

Changes

DNS-aware routing, LB resolution, and client dialing

Layer / File(s) Summary
Domain & DNS Detection
test/extended/router/http2.go, test/extended/router/gatewayapicontroller.go
Adds getClusterBaseDomainName(...) and DNS-zone/managed detection helpers and derives DNS-managed state from configured DNS zones.
Load Balancer & Gateway readiness
test/extended/router/gatewayapicontroller.go, test/extended/router/gatewayapi_upgrade.go
Adds getLoadBalancerAddress(...), integrates LB address into assertGatewayLoadbalancerReady, and unconditionally invokes DNSRecord assertions that poll per DNS zone. GatewayAPI upgrade connectivity checks now gate only on loadBalancerSupported.
HTTP/gRPC dialing and connectivity checks
test/extended/router/gatewayapicontroller.go, test/extended/router/http2.go, test/extended/router/idle.go, test/extended/router/grpc-interop.go, test/extended/router/grpc-interop/clientconn.go
makeHTTPClient(..., lbAddress) and DialParams.Target added; HTTP connectivity now resolves hostnames via DNS when zones exist or uses LB address otherwise. gRPC Dial uses a context dialer to redirect TCP to Target:<original-port> when provided.
Domain updates in tests
test/extended/router/h2spec.go, test/extended/router/grpc-interop.go
Tests now build router shard FQDNs from the cluster base domain instead of the previous default ingress domain lookup.
DNSRecord verification
test/extended/router/gatewayapicontroller.go
assertDNSRecordStatus refactored to locate DNSRecords and poll across zones, requiring Published=True when DNS zones are configured and requiring Published!=True when they are not.
Minor cleanup
test/extended/router/gatewayapicontroller.go
Removed unused crypto/tls import and consolidated LB address extraction logic.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Router Test
    participant DNS as dnses.config/cluster
    participant Ingress as IngressController
    participant LB as Load Balancer Service
    participant Client as HTTP/gRPC Client
    participant Route as Router Endpoint

    Test->>DNS: getClusterBaseDomainName()
    DNS-->>Test: baseDomain

    Test->>Ingress: isDNSZoneConfigured() / isDNSManaged()
    Ingress-->>Test: DNS-managed status

    alt DNS unmanaged
        Test->>LB: getLoadBalancerAddress(serviceName)
        LB-->>Test: lbAddress (IP/hostname)
    else DNS managed
        Note over Test: lbAddress = ""
    end

    alt lbAddress provided
        Test->>Client: makeHTTPClient(..., lbAddress) / DialParams.Target=lbAddress
        Note over Client: context dialer connects to lbAddress:port
        Client->>LB: tcp connect to lbAddress:port
        LB-->>Client: connection
    else DNS resolution
        Client->>DNS: resolve route hostname
        DNS-->>Client: route IP
        Client->>Route: connect to route IP:port
        Route-->>Client: connection
    end

    Client->>Route: HTTP/gRPC request
    Route-->>Client: response
    Test->>Client: verify success
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Critical compilation error in grpc-interop.go: Line 199 uses undefined variable 'baseDomain' assigned at line 56 to 'defaultDomain'. Should call getClusterBaseDomainName() instead. Change line 56 from getDefaultIngressClusterDomainName() to getClusterBaseDomainName() and update variable name from defaultDomain to baseDomain.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly references the bug number (OCPBUGS-59176) and accurately describes the main objective: fixing failing tests in custom-dns jobs.
Docstring Coverage ✅ Passed Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test titles in modified router files use static, deterministic strings. No dynamic values found in It/Describe/Context/When declarations.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are added. All changes modify existing tests in test/extended/router/. Check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo tests are being added in this PR. All changes are modifications to existing test files. The custom check only applies when "new Ginkgo e2e tests are added."
Topology-Aware Scheduling Compatibility ✅ Passed This PR modifies only test code in test/extended/router/. No deployment manifests, operators, or controllers are modified. No scheduling constraints or topology-dependent logic are introduced.
Ote Binary Stdout Contract ✅ Passed All changes are within Ginkgo test blocks with intercepted stdout. No process-level stdout writes detected. Logging uses e2e.Logf or safe string formatting only.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR modifies existing tests, not adding new ones. Modifications are IPv6-compatible with no hardcoded IPv4 addresses, proper net.JoinHostPort usage, and no external connectivity requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from gcs278 and rikatz May 5, 2026 14:55
@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 5, 2026

This is basically the fix added via #30131 with rebase conflicts fixed.

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@sadasu: This pull request references Jira Issue OCPBUGS-59176, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

Details

In response to this:

Some e2e tests are failing with the job "gcp-custom-dns" for featuregate "GCPClusterHostedDNSInstall" which is promoted to GA in 4.20. In the "custom-dns" cluster OpenShift will start static CoreDNS pods to provide DNS resolution for API, Internal API and Ingress services that are essential for cluster creation. After cluster deployment is completed, the customer will update their external DNS solution with the same assigned LB IP addresses used for the configuration of the internal CoreDNS instance.

The failing tests like http2 and grpc tests use dedicated ingresscontrollers, and gateway also has separated LB and dnsrecord, so the default wildcard created by the new static CoreDNS won't work for those tests.

Make below changes to fix the failing tests:

  • Update makeHTTPClient() DialContext to target LoadBalancer IP address directly when DNS is unmanaged, fixing both http2 and gatewayapi httproute tests.
  • Update grpc Dial() to target LoadBalancer IP address directly when DNS is unmanaged, fixing the grpc test.
  • Extract getLoadBalancerAddress() as a reusable helper from assertGatewayLoadbalancerReady() for use across http2, grpc and gatewayapi tests.
  • Update gatewayapi assertDNSRecordStatus() to check the Published condition per-zone based on whether DNS is managed or unmanaged.
  • Add isDNSManaged() helper that checks the DNSManaged condition on the default ingresscontroller, defaulting to managed=true when the condition is absent.
  • Add getClusterBaseDomainName() and update http2/grpc/h2spec shard ingresscontroller to generate domain based on cluster baseDomain instead of ingress domain (e.g. "e2e-test-xxx.apps.baseDomain"), avoiding custom ingresscontroller DNS overlapping with default wildcard "*.apps.baseDomain".

Summary by CodeRabbit

  • Tests
  • Improved gateway and HTTP route validation testing to properly handle managed vs. unmanaged DNS scenarios.
  • Enhanced load balancer address resolution in router connectivity tests.
  • Refactored test infrastructure to use cluster base domain for more accurate domain resolution across different DNS configurations.
  • Better test coverage for gRPC and HTTP/2 routing scenarios with varied DNS management states.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 5, 2026

/jira refresh

@openshift-ci openshift-ci Bot requested a review from melvinjoseph86 May 5, 2026 14:56
@openshift-ci-robot
Copy link
Copy Markdown

@sadasu: This pull request references Jira Issue OCPBUGS-59176, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/router/gatewayapicontroller.go`:
- Around line 406-407: The test currently always runs assertHttpRouteConnection
which for unmanaged-DNS paths uses getLoadBalancerAddress and therefore requires
a LoadBalancer; guard these checks by first verifying the platform supports load
balancers (check loadBalancerSupported) and only call assertHttpRouteConnection
/ getLoadBalancerAddress when loadBalancerSupported is true (while preserving
the test coverage for unmanaged-DNS on platforms where an LB exists); apply the
same guard around the other occurrences referenced (the block around lines
1079-1097) so tests on platforms without LB support skip the connectivity check
instead of hanging.

In `@test/extended/router/http2.go`:
- Around line 644-671: The new isDNSManaged(oc *exutil.CLI, timeout
time.Duration) conflicts with the existing isDNSManaged(oc *exutil.CLI) in
package router; rename the new function (e.g., isDNSManagedWithTimeout or
isDNSManagedTimeout) and update all callers to the new name and signature:
change calls at http2.go:504, grpc-interop.go:333, and gatewayapicontroller.go
lines referenced (around 807 and 1080) to pass the timeout argument and call the
renamed function; ensure the function body and returned types remain the same
and only the identifier and call sites are adjusted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d894b75a-2375-4917-96ed-96cf3881c44f

📥 Commits

Reviewing files that changed from the base of the PR and between 954d227 and 867e786.

📒 Files selected for processing (6)
  • test/extended/router/gatewayapicontroller.go
  • test/extended/router/grpc-interop.go
  • test/extended/router/grpc-interop/clientconn.go
  • test/extended/router/h2spec.go
  • test/extended/router/http2.go
  • test/extended/router/idle.go

Comment thread test/extended/router/gatewayapicontroller.go Outdated
Comment thread test/extended/router/http2.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/router/gatewayapi_upgrade.go`:
- Around line 141-143: The test currently gates connectivity checks behind
t.managedDNS so the unmanaged-DNS code path in assertHttpRouteConnection never
runs; change the condition around the pre- and post-upgrade HTTP checks
(currently checking t.loadBalancerSupported && t.managedDNS) to always run when
t.loadBalancerSupported is true, and call assertHttpRouteConnection with the
appropriate gwServiceName for the unmanaged case (use
gatewayName+"-openshift-default" for managed and pass the gwServiceName argument
when t.managedDNS is false) so both managed and unmanaged DNS paths are
validated; update both occurrences near where assertHttpRouteConnection is
invoked (lines around the current calls that build
gatewayName+"-openshift-default" and the later call that should pass
gwServiceName) accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5038e62e-a26c-4934-bb58-fc38fb07f58e

📥 Commits

Reviewing files that changed from the base of the PR and between 867e786 and 5b8c8e3.

📒 Files selected for processing (7)
  • test/extended/router/gatewayapi_upgrade.go
  • test/extended/router/gatewayapicontroller.go
  • test/extended/router/grpc-interop.go
  • test/extended/router/grpc-interop/clientconn.go
  • test/extended/router/h2spec.go
  • test/extended/router/http2.go
  • test/extended/router/idle.go
✅ Files skipped from review due to trivial changes (1)
  • test/extended/router/h2spec.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • test/extended/router/grpc-interop/clientconn.go
  • test/extended/router/gatewayapicontroller.go
  • test/extended/router/grpc-interop.go

Comment thread test/extended/router/gatewayapi_upgrade.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/extended/router/gatewayapicontroller.go (1)

406-406: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip the unmanaged-DNS connectivity path when the platform has no LoadBalancer support.

This now falls back to getLoadBalancerAddress(...) for every unmanaged-DNS run. On VSphere/BareMetal/Equinix, loadBalancerSupported is false, so the test waits for a LoadBalancer ingress that will never appear instead of skipping cleanly.

🛠️ Minimal guard
-		assertHttpRouteConnection(oc, gw+"-openshift-default", defaultRoutename)
+		if managedDNS || loadBalancerSupported {
+			assertHttpRouteConnection(oc, gw+"-openshift-default", defaultRoutename)
+		} else {
+			g.Skip("Skipping HTTPRoute connectivity check: unmanaged DNS and no LoadBalancer support")
+		}

Also applies to: 1072-1090

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/router/gatewayapicontroller.go` at line 406, The test calls
assertHttpRouteConnection which triggers getLoadBalancerAddress for the
unmanaged-DNS path and blocks on a LoadBalancer ingress even on platforms
without LoadBalancer support; update the test to check the platform capability
(the loadBalancerSupported flag or equivalent) before invoking the unmanaged-DNS
connectivity path and skip calling assertHttpRouteConnection /
getLoadBalancerAddress when loadBalancerSupported is false (also apply the same
guard to the other occurrences around lines 1072-1090) so the test cleanly skips
the LoadBalancer-dependent checks on VSphere/BareMetal/Equinix.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@test/extended/router/gatewayapicontroller.go`:
- Line 406: The test calls assertHttpRouteConnection which triggers
getLoadBalancerAddress for the unmanaged-DNS path and blocks on a LoadBalancer
ingress even on platforms without LoadBalancer support; update the test to check
the platform capability (the loadBalancerSupported flag or equivalent) before
invoking the unmanaged-DNS connectivity path and skip calling
assertHttpRouteConnection / getLoadBalancerAddress when loadBalancerSupported is
false (also apply the same guard to the other occurrences around lines
1072-1090) so the test cleanly skips the LoadBalancer-dependent checks on
VSphere/BareMetal/Equinix.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 74cc852c-6939-4f81-8656-465de3a70ce0

📥 Commits

Reviewing files that changed from the base of the PR and between 5b8c8e3 and a36edae.

📒 Files selected for processing (1)
  • test/extended/router/gatewayapicontroller.go

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Some e2e tests are failing with the job "gcp-custom-dns" for featuregate
"GCPClusterHostedDNSInstall" which is promoted to GA in 4.20. In the
"custom-dns" cluster OpenShift will start static CoreDNS pods to provide
DNS resolution for API, Internal API and Ingress services that are
essential for cluster creation. After cluster deployment is completed,
the customer will update their external DNS solution with the same
assigned LB IP addresses used for the configuration of the internal
CoreDNS instance.

The failing tests like http2 and grpc tests use dedicated
ingresscontrollers, and gateway also has separated LB and dnsrecord, so
the default wildcard created by the new static CoreDNS won't work for
those tests.

Make below changes to fix the failing tests:

- Update makeHTTPClient() DialContext to target LoadBalancer IP address
  directly when DNS is unmanaged, fixing both http2 and gatewayapi
  httproute tests.
- Update grpc Dial() to target LoadBalancer IP address directly when DNS
  is unmanaged, fixing the grpc test.
- Extract getLoadBalancerAddress() as a reusable helper from
  assertGatewayLoadbalancerReady() for use across http2, grpc and
  gatewayapi tests.
- Update gatewayapi assertDNSRecordStatus() to check the Published
  condition per-zone based on whether DNS is managed or unmanaged.
- Add isDNSManaged() helper that checks the DNSManaged condition on the
  default ingresscontroller, defaulting to managed=true when the
  condition is absent.
- Add getClusterBaseDomainName() and update http2/grpc/h2spec shard
  ingresscontroller to generate domain based on cluster baseDomain
  instead of ingress domain (e.g. "e2e-test-xxx.apps.baseDomain"),
  avoiding custom ingresscontroller DNS overlapping with default
  wildcard "*.apps.baseDomain".
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 7, 2026

/test e2e-metal-ipi-ovn-ipv6

@bentito
Copy link
Copy Markdown
Contributor

bentito commented May 11, 2026

/assign @rhamini3 @gcs278

@rhamini3
Copy link
Copy Markdown
Contributor

/test e2e-gcp-custom-dns

@rhamini3
Copy link
Copy Markdown
Contributor

/test e2e-gcp-ovn-techpreview

Copy link
Copy Markdown
Contributor

@gcs278 gcs278 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! hope this helps. If possible, I'd like to trim down some of the changes that aren't truly necessary. But feel free to reach out if you have any questions.

func assertDNSRecordStatus(oc *exutil.CLI, gatewayName string) {
// find the DNS Record and confirm its zone status is True
err := wait.PollUntilContextTimeout(context.Background(), 2*time.Second, 10*time.Minute, false, func(context context.Context) (bool, error) {
dnsManaged, err := isDNSManaged(oc, time.Minute)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are mixing concepts for DNSManagement here.

Previously isDNSManaged checked whether the zones existed in the dnsConfig - it was probably a mistake to call it isDNSManaged - should have been isDNSSupported or isDNSZoneConfigured more generically.

With this PR, isDNSManaged is checking the default IngressController for the DNSManaged condition being set to False.

The problem is that Gateway API has its own distinct API, configuration, and reconciliation logic, separate from IngressControllers. So, though it may work by coincidence, it feels a bit fragile and might bite us on future test updates.

Consider a scenario where another test has modified the default IngressController while the cluster has DNS zones configured (PublicZone/PrivateZone are set). For example:

  • Changed the endpoint publishing strategy to HostNetwork or NodePort
  • Set spec.endpointPublishingStrategy.loadBalancer.dnsManagementPolicy: Unmanaged

In either case, the default IngressController's DNSManaged condition would be False. However, Gateway API would still be creating and publishing DNSRecords successfully, since zones exist and the domain matches. The test would incorrectly skip DNS validation or bypass DNS resolution by dialing the LB IP directly.

My suggestion: For the Gateway API specific tests - I think we should keep the same logic from the old isDNSManaged (check dns.config.openshift.io/cluster PublicZone/PrivateZone), but rename it to isDNSZoneConfigured to avoid naming confusion over the DNSManaged IngressController condition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that definitely makes sense.

Comment on lines 370 to 372
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

managedDNS was pushed down a level into assertDNSRecordStatus but we are still guarding all of the calls to it - see comments in assertDNSRecordStatus for more details on whether to leave or remove.

Comment on lines +825 to +828
if len(gatewayDNSRecord.Status.Zones) == 0 {
e2e.Logf("DNS record %q has no zones yet, retrying...", gatewayDNSRecord.Name)
return false, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are now enabling assertDNSRecordStatus to check DNSRecords for unmanaged clusters in this PR - however, this logic will never pass because (see comment above) Gateway API will be unmanaged only when there are no zones. So this will get stuck.

I think we have two options:

  1. Keep it simple - don't call assertDNSRecordStatus for unsupported DNS clusters (and remove all changes to assertDNSRecordStatus)
  2. Or, you need to check dnsManaged here, and return early - there are no zones to check

Gateway API DNS logic is simpler than IngressController: the DNSRecord is always created, and whether it actually publishes depends only on whether zones exist and the domain matches baseDomain. There's no per-Gateway opt-out mechanism like the IngressController's dnsManagementPolicy or endpointPublishingStrategy fields.

Transport: &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
},
func assertHttpRouteConnection(oc *exutil.CLI, gwServiceName, hostname string, loadBalancerSupported bool) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit we guard against calling this function by checking loadBalancerSupported, but then we also now pass it in - is there a reason to do both?

The branch for e2e.Failf("Platform does not support load balancers and DNS is unmanaged - cannot verify HTTP route connectivity") will never get reached. I understand it's a safe guard, but in practice our code is getting a bit too branchy - and simpler is better IMO.

I think you should do one or the other. If you pass it in - rather than fail just return early with a log saying "skipping this...because..." or just remove it entirely - we already guard against calling it. I'd say just remove it as a new function parameter and keep this logic simple.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Fixing that.

Comment thread test/extended/router/grpc-interop.go Outdated

defaultDomain, err := getDefaultIngressClusterDomainName(oc, time.Minute)
o.Expect(err).NotTo(o.HaveOccurred(), "failed to find default domain name")
baseDomain, err := getClusterBaseDomainName(oc, time.Minute)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite see the point of changing these domains - I get that using defaultDomain as the parent domain causes DNS to point to the default IngressController (due to the CoreDNS Internal Static DNS resolver).

However, we can't use the CoreDNS Internal resolves anyways - we bypass it when we detect isDNSManaged below.

So I think this change adds complexity without solving a real problem.

}
}

func isDNSManaged(oc *exutil.CLI, timeout time.Duration) (bool, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a function comment on how this works since it's not obvious (how do we determine what "isDNSManaged").

}
if !dnsManaged && condition.Status == "True" {
e2e.Logf("DNS record %q zone %v is unexpectedly published for unmanaged DNS, retrying...", gatewayDNSRecord.Name, zone.DNSZone)
return false, nil
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not knowing the Gateway API code, would you be OK ignoring this condition and recording as a debug/info message instead of reporting this as an error?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm following: this code can never be reached in any Gateway API test scenario today. If DNS isn't configured, there are no zones to check against. Gateway API DNS is strictly on or off, there's no partial state.

I think I understand some of the confusion - when #30131 was written before I added isDNSManaged in #30946. The rebase doesn't really make sense like it did before.

For DNS concerns, the previously merged #30946 already handles the unmanaged DNS case for Gateway API tests. The changes to assertDNSRecordStatus in this PR aren't needed. Consider reverting all DNS-related changes to the Gateway API tests, except for:

  • The isDNSManaged rename (good idea)
  • The load balancer address changes that you need in assertHttpRouteConnection

Let me know if I'm missing something, but I'd rather not change anything unnecessarily.

In Gateway API tests, we determine if the DNS is managed if zones
are configured in the cluster's DNS config resource.
In other tests, it is determined by looking at the
DNSManagedIngressConditionType of the default IngressController.
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sadasu
Once this PR has been reviewed and has the lgtm label, please ask for approval from gcs278. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
test/extended/router/gatewayapicontroller.go (1)

827-839: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle zero-zone DNSRecord as success when DNS zones are not configured.

In unmanaged/custom-dns clusters, DNSRecord.Status.Zones can remain empty. The current retry-on-empty logic causes avoidable timeout failures.

💡 Suggested fix
-		gatewayDNSRecord := &operatoringressv1.DNSRecord{}
+		gatewayDNSRecord := &operatoringressv1.DNSRecord{}
+		found := false
 		gatewayDNSRecords, err := oc.AdminIngressClient().IngressV1().DNSRecords(ingressNamespace).List(context, metav1.ListOptions{})
 		if err != nil {
 			e2e.Logf("Failed to list DNS records for gateway %q: %v, retrying...", gatewayName, err)
 			return false, nil
 		}
@@
 		for _, record := range gatewayDNSRecords.Items {
 			if record.Labels["gateway.networking.k8s.io/gateway-name"] == gatewayName {
 				gatewayDNSRecord = &record
+				found = true
 				e2e.Logf("Found the desired dnsrecord and spec is: %v", gatewayDNSRecord.Spec)
 				break
 			}
 		}
+		if !found {
+			e2e.Logf("DNS record for gateway %q not found yet, retrying...", gatewayName)
+			return false, nil
+		}
 
 		if len(gatewayDNSRecord.Status.Zones) == 0 {
+			if !dnsManaged {
+				e2e.Logf("DNS zones are not configured; DNSRecord %q has no zones as expected", gatewayDNSRecord.Name)
+				return true, nil
+			}
 			e2e.Logf("DNS record %q has no zones yet, retrying...", gatewayDNSRecord.Name)
 			return false, nil
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/router/gatewayapicontroller.go` around lines 827 - 839, The
test currently treats a DNSRecord with zero Status.Zones as a failure and
retries; update the logic in the block that inspects gatewayDNSRecord (the loop
over gatewayDNSRecords.Items and the subsequent check of
gatewayDNSRecord.Status.Zones) to first ensure gatewayDNSRecord is non-nil, and
then treat an empty Zones slice as acceptable for unmanaged/custom-dns clusters
by logging that no zones are configured and returning success (true, nil)
instead of retrying; keep existing behavior for nil gatewayDNSRecord (continue
searching) and only return success when the record exists but has zero zones.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/router/grpc-interop.go`:
- Line 199: The code uses an undefined variable baseDomain when building
shardFQDN; replace baseDomain with the existing defaultDomain variable (the one
assigned earlier) so shardFQDN := oc.Namespace() + "." + defaultDomain (i.e.,
use defaultDomain instead of baseDomain) to fix the compile error involving
shardFQDN, oc.Namespace(), baseDomain and defaultDomain.

---

Duplicate comments:
In `@test/extended/router/gatewayapicontroller.go`:
- Around line 827-839: The test currently treats a DNSRecord with zero
Status.Zones as a failure and retries; update the logic in the block that
inspects gatewayDNSRecord (the loop over gatewayDNSRecords.Items and the
subsequent check of gatewayDNSRecord.Status.Zones) to first ensure
gatewayDNSRecord is non-nil, and then treat an empty Zones slice as acceptable
for unmanaged/custom-dns clusters by logging that no zones are configured and
returning success (true, nil) instead of retrying; keep existing behavior for
nil gatewayDNSRecord (continue searching) and only return success when the
record exists but has zero zones.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 1b81a94c-c3d4-4302-96f9-36e4ff8dcaff

📥 Commits

Reviewing files that changed from the base of the PR and between a36edae and edfb166.

📒 Files selected for processing (7)
  • test/extended/router/gatewayapi_upgrade.go
  • test/extended/router/gatewayapicontroller.go
  • test/extended/router/grpc-interop.go
  • test/extended/router/grpc-interop/clientconn.go
  • test/extended/router/h2spec.go
  • test/extended/router/http2.go
  • test/extended/router/idle.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • test/extended/router/h2spec.go
  • test/extended/router/idle.go
  • test/extended/router/grpc-interop/clientconn.go

o.Expect(err).NotTo(o.HaveOccurred())

shardFQDN := oc.Namespace() + "." + defaultDomain
shardFQDN := oc.Namespace() + "." + baseDomain
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify symbol usage in this file/package.
rg -n --type=go '\bdefaultDomain\b' test/extended/router/grpc-interop.go
rg -n --type=go '\bbaseDomain\b' test/extended/router/grpc-interop.go
rg -n --type=go '\bgetClusterBaseDomainName\s*\(' test/extended/router

Repository: openshift/origin

Length of output: 511


Fix undefined baseDomain variable to resolve compile-time failure.

Line 199 uses baseDomain, but line 56 assigns to defaultDomain instead. This leaves baseDomain undefined and causes a compilation error.

Suggested fix
-			defaultDomain, err := getDefaultIngressClusterDomainName(oc, time.Minute)
-			o.Expect(err).NotTo(o.HaveOccurred(), "failed to find default domain name")
+			baseDomain, err := getClusterBaseDomainName(oc, time.Minute)
+			o.Expect(err).NotTo(o.HaveOccurred(), "failed to find cluster base domain name")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/router/grpc-interop.go` at line 199, The code uses an undefined
variable baseDomain when building shardFQDN; replace baseDomain with the
existing defaultDomain variable (the one assigned earlier) so shardFQDN :=
oc.Namespace() + "." + defaultDomain (i.e., use defaultDomain instead of
baseDomain) to fix the compile error involving shardFQDN, oc.Namespace(),
baseDomain and defaultDomain.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

@sadasu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify edfb166 link true /test verify
ci/prow/unit edfb166 link true /test unit
ci/prow/okd-scos-images edfb166 link true /test okd-scos-images
ci/prow/images edfb166 link true /test images

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants