Skip to content

feat(acm): add ACM certificate management feature#4554

Open
the-technat wants to merge 30 commits intokubernetes-sigs:mainfrom
the-technat:issue/2509
Open

feat(acm): add ACM certificate management feature#4554
the-technat wants to merge 30 commits intokubernetes-sigs:mainfrom
the-technat:issue/2509

Conversation

@the-technat
Copy link
Contributor

@the-technat the-technat commented Jan 27, 2026

Issue

Closes #2509

Description

This PR adds support for automatically provisioning ACM certificates based on ingress annotations.

An initial design idea is described in the issue: #2509 (comment).

How it works: see the prepared user-docs for a description of the feature.

I focused on ingress objects in this feature to keep the change "small". But I think most of the Synthesizer / Manager could be reused for implementing the same thing for Gateway API. Maybe as an alternative approach for #4494 (skipping cert-manager and directly manage the cert).

Internals

Some internals not mentioned in the user-facing docs:

  • due to a certain delay between certificate requesting and issuance, the controller has a (currently non-user-configurable) time to wait for the certificate to be come issued. This was implemented similar to the wait behavior of the ELBv2 service
    • if the timeout hits before the certificate got issued, the reconciliation will fail with an error and retry according to the controller's standard retry mechanism
    • a subsequent reconciliation attempt will discover the already requested certificate and wait another round for it to become issued, unless the already requested certificate has exceeded a certain age (currently 5 minutes), in this case the certificate is recreated
  • certificate state tracking is based on the Tagging Manager interface used for other resources as well, thus orphaned certificates are cleaned up as part of the PostSynthesize phase. Due to time delay between switching certificates for a listener and the listener actually releasing the previously used certificate the deletion of an orphaned certificate is in any case retried a couple of times if it's still in use.
  • certificate validation (only required for Amazon Issued certificates) is only support using DNS Method and Route53. The validation records are created automatically and pruned when a certificate is deleted. Due to the possibility of multiple certificates using the same CNAME record, we ignore "not found" and "other value" errors, as another certificate not managed by the controller having the same domains / CA will use the exact same CNAME record for validation, just with another value
  • ingress objects that are modified and whoose hosts set change, will trigger a new certificate to be requested with more Subject Alternative Names

Tests

I added unit tests where possible and feasible.

All cases have been manually tested multiple times in a real environment using EKS, PCA, ACM and an intermediate build of the AWS Load Balancer Controller

API rate limiting

To visualize the impact of this feature on API requests I tried collecting the number of API requests that occur per reconciliation attempt of one ingress object.

I took the rates from here: API rate quotas and identified the following ones as used by my feature: RequestCertificate, ListTagsForCertificate, ListCertificates, DeleteCertificate and DescribeCertificate. All of those operations have a rate-limit of either 5 or 10 queries per second.

Here's how often they are called:

  • ListTagsForCertificate & ListCertificates: one request every minute to rebuild the in-memory cache in the Certificate Discovery part & ACM Tagging Manager
  • RequestCertificate: at most once per reconciliation. If an existing matching certificate is found we'll use this and rather wait for it's issuance than request another certificate.
  • DescribeCertificate: is used at 3 different places:
    • DNS validation records: after requesting a certificate we wait for it to provide DNS records values. This uses a retry-mechanism with a request every 5s up to a 30s timeout -> at most 6 requested with a 5s delay in-between
    • certificate waiter: waits for the certificate to be issued: has timeout of 5m and a minDelay between requests of 60s increasing exponentially up a maxDelay of 120s
    • Deleting Certificates: to obtain the validation records values we need to clean up, one request per certificate deletion attempt is made too
  • DeleteCertificate: has a retry-mechanism that tries to delete the certificate if one request fails with a retry wait interval of 5s and a total timeout of 30s -> per certificate there's at most 6 requests with a 5s delay in-between

In addition to the ACM API rate quotas the ones for Route53 might also be relevant.

These rates are taken from here: Route53 API request limits.

Here's how often they are called:

  • ListHostedZones: one request every 5 minutes to rebuild the in-memory cache in the Route53 service
  • ChangeResourceRecordSets: one request per certificate issue request and one request per certificate deletion attempt

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the docs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 27, 2026
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 27, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @the-technat. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 27, 2026
@the-technat the-technat force-pushed the issue/2509 branch 23 times, most recently from 1fbc015 to 8fd5564 Compare January 29, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatically provision ACM certificates and attach to ALB based on spec

4 participants