Skip to content

ci: build container on Azure for main builds#2521

Draft
kajalj22 wants to merge 1 commit into
mainfrom
kajalj/seed-azure-cache
Draft

ci: build container on Azure for main builds#2521
kajalj22 wants to merge 1 commit into
mainfrom
kajalj/seed-azure-cache

Conversation

@kajalj22
Copy link
Copy Markdown
Contributor

@kajalj22 kajalj22 commented May 18, 2026

Summary

  • Adds a build-container-azure job to cicd-main.yml that builds and pushes the container image to nemoci.azurecr.io on every main branch build (including nightly)
  • Runs in parallel with the existing GCP build — needed because GCP runners are ARM64 and Azure runners are x86_64, so images can't be copied across
  • Keeps the Azure :main tag fresh so external contributor PR builds have a warm inline cache and can run CI:Lfast/CI:docs (which reuse the :main image)

Context

External contributor builds run on Azure x86_64 runners and use nemoci.azurecr.io/rl:main as an inline Docker cache source. Since main branch builds only ran on GCP ARM64 runners, the Azure :main tag was never updated — causing two issues:

  1. Cache misses on Dockerfile changes: When the base image changed to cuda-dl-base:26.03-cuda13.2 in chore: Enable cuda-13 build #2332, BuildKit couldn't find cached layers and tried pulling from nvcr.io directly, hitting rate-limit 401 errors (e.g. feat: add AIME-2026 benchmark. #2469)
  2. CI:Lfast / CI:docs broken for external contributors: These test levels skip the container build and reuse the :main image, which didn't exist on the Azure registry

Test plan

  • Verified the Azure build job runs and completes successfully
  • After merge, verify build-container-azure runs on next nightly/main push
  • Re-run PR feat: add AIME-2026 benchmark. #2469 CI to confirm external contributor builds work
  • Verify CI:Lfast works for an external contributor PR

🤖 Generated with Claude Code

@kajalj22 kajalj22 requested a review from a team as a code owner May 18, 2026 16:06
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the CI Relating to CI label May 18, 2026
@kajalj22 kajalj22 marked this pull request as draft May 18, 2026 16:08
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@kajalj22
Copy link
Copy Markdown
Contributor Author

/ok to test 9e8e669

@kajalj22
Copy link
Copy Markdown
Contributor Author

/ok to test a74912a

@kajalj22 kajalj22 marked this pull request as ready for review May 18, 2026 16:51
Comment thread .github/workflows/cicd-main.yml Outdated

- name: Copy main image to Azure registry
env:
SRC: ${{ needs.org-member-pre-flight.outputs.registry }}/${{ vars.CI_CONTAINER_NAME }}:main
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you expect this to resolve to for external contributor?

Add a parallel build-container-azure job that builds and pushes the
container image to the Azure registry (nemoci.azurecr.io) on every
main branch build. This runs alongside the existing GCP build since
the architectures differ (GCP=ARM64, Azure=x86_64) and images cannot
be copied across.

External contributor PR builds use Azure runners with inline cache
from nemoci.azurecr.io/rl:main. Without this job, the Azure :main
tag goes stale whenever the Dockerfile changes, causing cache misses
that can fail with nvcr.io rate-limit 401 errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Kajal Jain <kajalj@nvidia.com>
@kajalj22 kajalj22 force-pushed the kajalj/seed-azure-cache branch from cc1099e to 2e6810a Compare May 18, 2026 23:29
@kajalj22 kajalj22 changed the title ci: seed Azure container cache when Dockerfile changes ci: build container on Azure for main builds May 18, 2026
@kajalj22 kajalj22 marked this pull request as ready for review May 18, 2026 23:30
@kajalj22 kajalj22 marked this pull request as draft May 19, 2026 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Relating to CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants