## fix: resolve MLflow app discovery issues#5924
Open
lucasjia-aws wants to merge 12 commits into
Open
Conversation
…resolution
The SageMakerClient singleton caches the first region it is initialized with and ignores subsequent region parameters. This causes Nova integ tests (which run in us-east-1) to fail when the singleton was already created with us-west-2 by an earlier test in the same process.
Errors observed:
- ModelPackageGroup arn:aws:sagemaker:us-west-2:784379639078:model-package-group/sdk-test-finetuned-models does not exist
- DescribeModelPackage: ARN should be scoped to correct region: us-west-2
Fix: use session.boto_session.client("sagemaker") directly instead of ModelPackageGroup.get() / ModelPackage.get() in the three call sites that resolve model package resources. This respects the session's actual region without depending on the singleton's cached state.
_update_pipeline_lineage assumed the version context always exists. When it's been deleted or never created (e.g. prior run failure), DescribeContext throws ResourceNotFound. Now catches the error and recreates the version context with proper associations.
…ates app Replace hard-coded MLflow app ARN with a conftest fixture that finds an existing ready app or creates a temporary one (cleaned up after tests). Prevents failures when the hard-coded app is deleted or quota is full. X-AI-Prompt: add self-healing mlflow fixture for llm_as_judge integ tests X-AI-Tool: kiro-cli
Collaborator
Author
|
serve integ test succeeded in a previous commit: https://github.com/aws/sagemaker-python-sdk/actions/runs/26923043569/job/79444479419, and the following commits have nothing to do with serve module. |
mujtaba1747
approved these changes
Jun 4, 2026
mujtaba1747
previously approved these changes
Jun 4, 2026
Per SDK coding standards, avoid calling boto3 directly. Use the session's sagemaker_client attribute which already has the correct region bound at session creation time.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes multiple test failures in
sagemaker-train-integ-teststhat emerged after the MTRL Launch PR (#5919) was merged. The root causes are: (1) an incorrect response key in MLflow app discovery, (2) deleted/hard-coded MLflow app ARNs in tests, and (3) a missing None check in feature processor lineage.Changes
benchmark,llm_as_judge,custom_scorer)_resolve_mlflow_resource_arninfinetune_utils.pyreadspage.get("MlflowApps", [])but thelist_mlflow_appsAPI returns apps under"Summaries". The function always sees an empty list, tries to create a new app, and fails on quota."MlflowApps"to"Summaries"in_resolve_mlflow_resource_arn.test_llm_as_judge_base_model_fixmlflow_tracking_server_arn(app-W7FOBBXZANVX) no longer exists in the test account.mlflow_resource_arnpytest fixture inconftest.pythat auto-discovers an existing ready MLflow app or creates a temporary one (with cleanup). Both tests in this file use the fixture.test_mtrl_trainer_integrationtest_mtrl_evaluatortest_mtrl_evaluator_3p_agenttest_multi_turn_rl_trainer_integrationapp-ZG6FYITNGMMU) was deleted from the test account.app-O4ZGQYBYHMRH(mtrl-integ-test) which is in Created state in the same account._feature_processor_lineage.pyAttributeErrorwhenpipeline_version_contextis None during lineage update.pipeline_version_contextattributes."MlflowApps"response key."MlflowApps"→"Summaries"in MLflow unit test mocks.Testing
sagemaker-trainunit tests: all passing locallysagemaker-train-integ-tests: MLflow-related failures resolved; remaining failures are infra flaky (AlgorithmError) or quota-limited MTRL eval jobs (pre-existing, unrelated)