Skip to content

ModelBuilder.deploy_local() does not authenticate to ECR before pulling DLC images (v3 regression) #5925

@manuwaik

Description

@manuwaik

Summary

In v3, ModelBuilder(mode=Mode.LOCAL_CONTAINER).deploy_local() calls docker.client.images.pull(image) directly with no auth handshake. When the target image lives in an AWS Deep Learning Containers ECR account (763104351884.dkr.ecr.<region>.amazonaws.com/...) — i.e. every SageMaker-provided container — the Docker daemon has no credentials, the pull fails, and the subsequent inspect_image returns 404, surfacing as:

ValueError: Could not find image '763104351884.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tritonserver:24.09-py3' in repository

This is a regression from v2 LocalSession/local.image, which performed aws ecr get-login before pulling.

Reproduction

from sagemaker.serve import ModelBuilder, Mode, ModelServer

builder = ModelBuilder(
    image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tritonserver:24.09-py3",
    s3_model_data_url="s3://.../model.tar.gz",
    role_arn=role,
    model_server=ModelServer.TRITON,
    mode=Mode.LOCAL_CONTAINER,
)
builder.build(model_name="x")
builder.deploy_local(endpoint_name="x", wait=True)

On a host where Docker has not previously authenticated to the DLC ECR account (e.g. a fresh SageMaker Notebook instance with Docker installed but unused), the pull fails.

Observed

HTTPError: 404 Client Error: Not Found for url:
http+docker://localhost/v1.44/images/763104351884.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tritonserver:24.09-py3/json
...
ImageNotFound: 404 Client Error ... ("No such image: ...")
...
ValueError: Could not find image '...' in repository

The images.pull() call upstream silently failed (it returns a 404-on-inspect rather than raising for the actual pull-time auth error), so the user-facing error blames "image not found" when the real cause is "Docker daemon was never told how to authenticate to this ECR registry."

Expected

Either:

  1. deploy_local() performs an aws ecr get-login-password | docker login for any ECR-flavored image_uri before calling images.pull(), mirroring v2 behavior; or
  2. deploy_local() raises a clear, actionable error explaining the user must pre-authenticate Docker to ECR, with the exact command shown.

Code reference

sagemaker/serve/mode/local_container_mode.py:270-278:

# Pull the image
try:
    logger.info("Pulling image %s from repository...", image)
    self.client.images.pull(image)
    logger.info("Successfully pulled image %s", image)
except docker.errors.NotFound as e:
    raise ValueError(f"Could not find image '{image}' in repository") from e
except docker.errors.APIError as e:
    raise RuntimeError(f"Failed to pull image '{image}': {e}") from e

No auth_config= passed to images.pull(), no ECR token retrieval, no detection of *.dkr.ecr.*.amazonaws.com hostnames.

Workaround

Pre-authenticate Docker and pre-pull the image before calling deploy_local():

aws ecr get-login-password --region <region> \
  | docker login --username AWS --password-stdin 763104351884.dkr.ecr.<region>.amazonaws.com

docker pull 763104351884.dkr.ecr.<region>.amazonaws.com/sagemaker-tritonserver:24.09-py3

Once the image is in the local Docker cache, the broken images.pull() call becomes effectively a no-op and deploy_local() proceeds.

Severity

Medium. Functionally blocks v3 local mode for any AWS-published container image out of the box, which is the most common image source for users. Workaround is mechanical but undocumented in the v3 inference docs.

Suggestion

Port the _ecr_login_if_needed helper from v2's sagemaker.local.image (which detects ECR hostnames and runs the login automatically) and call it from local_container_mode.py:_pull_image() before images.pull().

Environment

  • OS: Linux 6.1.170-210.320.amzn2023.x86_64 (Amazon Linux 2023)
  • Host: SageMaker Notebook instance (BaseNotebookInstanceEc2InstanceRole)
  • Python: 3.10 (/home/ec2-user/anaconda3/envs/python3/bin/python)
  • Kernel: conda_python3
  • sagemaker: 3.12.0
  • sagemaker-core: 2.12.0
  • sagemaker-serve: 1.12.0
  • sagemaker-train: 1.12.0
  • sagemaker-mlops: 1.12.0
  • docker (client): 25.0.14
  • Region: us-east-1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions