Skip to content

## fix: resolve MLflow app discovery issues#5924

Open
lucasjia-aws wants to merge 12 commits into
aws:masterfrom
lucasjia-aws:nova_tests
Open

## fix: resolve MLflow app discovery issues#5924
lucasjia-aws wants to merge 12 commits into
aws:masterfrom
lucasjia-aws:nova_tests

Conversation

@lucasjia-aws
Copy link
Copy Markdown
Collaborator

@lucasjia-aws lucasjia-aws commented Jun 3, 2026

Summary

This PR fixes multiple test failures in sagemaker-train-integ-tests that emerged after the MTRL Launch PR (#5919) was merged. The root causes are: (1) an incorrect response key in MLflow app discovery, (2) deleted/hard-coded MLflow app ARNs in tests, and (3) a missing None check in feature processor lineage.

Changes

Test / Component Problem Fix
All evaluator integ tests (benchmark, llm_as_judge, custom_scorer) _resolve_mlflow_resource_arn in finetune_utils.py reads page.get("MlflowApps", []) but the list_mlflow_apps API returns apps under "Summaries". The function always sees an empty list, tries to create a new app, and fails on quota. Change the key from "MlflowApps" to "Summaries" in _resolve_mlflow_resource_arn.
test_llm_as_judge_base_model_fix Hard-coded mlflow_tracking_server_arn (app-W7FOBBXZANVX) no longer exists in the test account. Add a mlflow_resource_arn pytest fixture in conftest.py that auto-discovers an existing ready MLflow app or creates a temporary one (with cleanup). Both tests in this file use the fixture.
test_mtrl_trainer_integration
test_mtrl_evaluator
test_mtrl_evaluator_3p_agent
test_multi_turn_rl_trainer_integration
Hard-coded MLflow app ARN (app-ZG6FYITNGMMU) was deleted from the test account. Replace with existing app-O4ZGQYBYHMRH (mtrl-integ-test) which is in Created state in the same account.
_feature_processor_lineage.py AttributeError when pipeline_version_context is None during lineage update. Add None check before accessing pipeline_version_context attributes.
Unit tests for MLflow fix Tests were mocking the old "MlflowApps" response key. Fix "MlflowApps""Summaries" in MLflow unit test mocks.

Testing

  • sagemaker-train unit tests: all passing locally
  • sagemaker-train-integ-tests: MLflow-related failures resolved; remaining failures are infra flaky (AlgorithmError) or quota-limited MTRL eval jobs (pre-existing, unrelated)

…resolution

The SageMakerClient singleton caches the first region it is initialized with and ignores subsequent region parameters. This causes Nova integ tests (which run in us-east-1) to fail when the singleton was already created with us-west-2 by an earlier test in the same process.

Errors observed:
- ModelPackageGroup arn:aws:sagemaker:us-west-2:784379639078:model-package-group/sdk-test-finetuned-models does not exist
- DescribeModelPackage: ARN should be scoped to correct region: us-west-2

Fix: use session.boto_session.client("sagemaker") directly instead of ModelPackageGroup.get() / ModelPackage.get() in the three call sites that resolve model package resources. This respects the session's actual region without depending on the singleton's cached state.
_update_pipeline_lineage assumed the version context always exists.
When it's been deleted or never created (e.g. prior run failure),
DescribeContext throws ResourceNotFound. Now catches the error and
recreates the version context with proper associations.
…ates app

Replace hard-coded MLflow app ARN with a conftest fixture that finds an
existing ready app or creates a temporary one (cleaned up after tests).
Prevents failures when the hard-coded app is deleted or quota is full.

X-AI-Prompt: add self-healing mlflow fixture for llm_as_judge integ tests
X-AI-Tool: kiro-cli
@lucasjia-aws lucasjia-aws changed the title fix: bypass SageMakerClient singleton for cross-region model package resolution ## fix: resolve cross-region singleton bug and MLflow app discovery issues Jun 4, 2026
@lucasjia-aws
Copy link
Copy Markdown
Collaborator Author

lucasjia-aws commented Jun 4, 2026

serve integ test succeeded in a previous commit: https://github.com/aws/sagemaker-python-sdk/actions/runs/26923043569/job/79444479419, and the following commits have nothing to do with serve module.

mujtaba1747
mujtaba1747 previously approved these changes Jun 4, 2026
Per SDK coding standards, avoid calling boto3 directly. Use the
session's sagemaker_client attribute which already has the correct
region bound at session creation time.
@lucasjia-aws lucasjia-aws deployed to auto-approve June 4, 2026 20:05 — with GitHub Actions Active
@lucasjia-aws lucasjia-aws changed the title ## fix: resolve cross-region singleton bug and MLflow app discovery issues ## fix: resolve MLflow app discovery issues Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants