## fix: resolve MLflow app discovery issues by lucasjia-aws · Pull Request #5924 · aws/sagemaker-python-sdk

lucasjia-aws · 2026-06-03T20:22:18Z

Summary

This PR fixes multiple test failures in sagemaker-train-integ-tests that emerged after the MTRL Launch PR (#5919) was merged. The root causes are: (1) an incorrect response key in MLflow app discovery, (2) deleted/hard-coded MLflow app ARNs in tests, and (3) a missing None check in feature processor lineage.

Changes

Test / Component	Problem	Fix
All evaluator integ tests (`benchmark`, `llm_as_judge`, `custom_scorer`)	`_resolve_mlflow_resource_arn` in `finetune_utils.py` reads `page.get("MlflowApps", [])` but the `list_mlflow_apps` API returns apps under `"Summaries"`. The function always sees an empty list, tries to create a new app, and fails on quota.	Change the key from `"MlflowApps"` to `"Summaries"` in `_resolve_mlflow_resource_arn`.
`test_llm_as_judge_base_model_fix`	Hard-coded `mlflow_tracking_server_arn` (`app-W7FOBBXZANVX`) no longer exists in the test account.	Add a `mlflow_resource_arn` pytest fixture in `conftest.py` that auto-discovers an existing ready MLflow app or creates a temporary one (with cleanup). Both tests in this file use the fixture.
`test_mtrl_trainer_integration` `test_mtrl_evaluator` `test_mtrl_evaluator_3p_agent` `test_multi_turn_rl_trainer_integration`	Hard-coded MLflow app ARN (`app-ZG6FYITNGMMU`) was deleted from the test account.	Replace with existing `app-O4ZGQYBYHMRH` (`mtrl-integ-test`) which is in Created state in the same account.
`_feature_processor_lineage.py`	`AttributeError` when `pipeline_version_context` is None during lineage update.	Add None check before accessing `pipeline_version_context` attributes.
Unit tests for MLflow fix	Tests were mocking the old `"MlflowApps"` response key.	Fix `"MlflowApps"` → `"Summaries"` in MLflow unit test mocks.

Testing

sagemaker-train unit tests: all passing locally
sagemaker-train-integ-tests: MLflow-related failures resolved; remaining failures are infra flaky (AlgorithmError) or quota-limited MTRL eval jobs (pre-existing, unrelated)

…resolution The SageMakerClient singleton caches the first region it is initialized with and ignores subsequent region parameters. This causes Nova integ tests (which run in us-east-1) to fail when the singleton was already created with us-west-2 by an earlier test in the same process. Errors observed: - ModelPackageGroup arn:aws:sagemaker:us-west-2:784379639078:model-package-group/sdk-test-finetuned-models does not exist - DescribeModelPackage: ARN should be scoped to correct region: us-west-2 Fix: use session.boto_session.client("sagemaker") directly instead of ModelPackageGroup.get() / ModelPackage.get() in the three call sites that resolve model package resources. This respects the session's actual region without depending on the singleton's cached state.

_update_pipeline_lineage assumed the version context always exists. When it's been deleted or never created (e.g. prior run failure), DescribeContext throws ResourceNotFound. Now catches the error and recreates the version context with proper associations.

…ates app Replace hard-coded MLflow app ARN with a conftest fixture that finds an existing ready app or creates a temporary one (cleaned up after tests). Prevents failures when the hard-coded app is deleted or quota is full. X-AI-Prompt: add self-healing mlflow fixture for llm_as_judge integ tests X-AI-Tool: kiro-cli

…_arn

lucasjia-aws · 2026-06-04T16:29:57Z

serve integ test succeeded in a previous commit: https://github.com/aws/sagemaker-python-sdk/actions/runs/26923043569/job/79444479419, and the following commits have nothing to do with serve module.

Per SDK coding standards, avoid calling boto3 directly. Use the session's sagemaker_client attribute which already has the correct region bound at session creation time.

lucasjia-aws temporarily deployed to auto-approve June 3, 2026 20:22 — with GitHub Actions Inactive

test: update unit tests

37d081a

lucasjia-aws temporarily deployed to auto-approve June 3, 2026 21:43 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 3, 2026 21:50 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 3, 2026 21:51 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 3, 2026 22:25 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 4, 2026 00:07 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 4, 2026 00:28 — with GitHub Actions Inactive

lucasjia-aws force-pushed the nova_tests branch from 019b18a to e34cc82 Compare June 4, 2026 00:33

lucasjia-aws temporarily deployed to auto-approve June 4, 2026 00:34 — with GitHub Actions Inactive

lucasjia-aws force-pushed the nova_tests branch from e34cc82 to fbfc0c7 Compare June 4, 2026 00:35

lucasjia-aws temporarily deployed to auto-approve June 4, 2026 00:36 — with GitHub Actions Inactive

fix(test): use correct response key "Summaries" for list_mlflow_apps API

c3644a7

lucasjia-aws temporarily deployed to auto-approve June 4, 2026 00:56 — with GitHub Actions Inactive

mark two slow tests as not serial

bd2f406

lucasjia-aws temporarily deployed to auto-approve June 4, 2026 04:56 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 4, 2026 06:03 — with GitHub Actions Inactive

fix: use correct response key "Summaries" in _resolve_mlflow_resource…

3f676da

…_arn

lucasjia-aws force-pushed the nova_tests branch from 407a916 to 3f676da Compare June 4, 2026 07:29

lucasjia-aws temporarily deployed to auto-approve June 4, 2026 07:29 — with GitHub Actions Inactive

replace not-existing mlflow app

c39b7dc

lucasjia-aws temporarily deployed to auto-approve June 4, 2026 07:38 — with GitHub Actions Inactive

lucasjia-aws changed the title ~~fix: bypass SageMakerClient singleton for cross-region model package resolution~~ ## fix: resolve cross-region singleton bug and MLflow app discovery issues Jun 4, 2026

mujtaba1747 approved these changes Jun 4, 2026

View reviewed changes

mujtaba1747 previously approved these changes Jun 4, 2026

View reviewed changes

refactor: use session.sagemaker_client instead of boto_session.client

cdd0500

Per SDK coding standards, avoid calling boto3 directly. Use the session's sagemaker_client attribute which already has the correct region bound at session creation time.

lucasjia-aws dismissed mujtaba1747’s stale review via cdd0500 June 4, 2026 17:19

lucasjia-aws temporarily deployed to auto-approve June 4, 2026 17:35 — with GitHub Actions Inactive

revert: remove SageMakerClient singleton bypass from feature code

1edeecb

lucasjia-aws temporarily deployed to auto-approve June 4, 2026 18:42 — with GitHub Actions Inactive

rsareddy0329 mentioned this pull request Jun 4, 2026

Master mtrl eval issue fix #5923

Merged

Merge branch 'master' into nova_tests

d0764cb

lucasjia-aws temporarily deployed to auto-approve June 4, 2026 19:30 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 4, 2026 19:32 — with GitHub Actions Inactive

Merge branch 'master' into nova_tests

f47deb5

lucasjia-aws temporarily deployed to auto-approve June 4, 2026 20:04 — with GitHub Actions Inactive

lucasjia-aws deployed to auto-approve June 4, 2026 20:05 — with GitHub Actions Active

lucasjia-aws changed the title ~~## fix: resolve cross-region singleton bug and MLflow app discovery issues~~ ## fix: resolve MLflow app discovery issues Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

## fix: resolve MLflow app discovery issues#5924

## fix: resolve MLflow app discovery issues#5924
lucasjia-aws wants to merge 12 commits into
aws:masterfrom
lucasjia-aws:nova_tests

lucasjia-aws commented Jun 3, 2026 •

edited

Loading

Uh oh!

lucasjia-aws commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lucasjia-aws commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Uh oh!

lucasjia-aws commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lucasjia-aws commented Jun 3, 2026 •

edited

Loading

lucasjia-aws commented Jun 4, 2026 •

edited

Loading