Improve hardware context selection with register validation#18
Merged
Conversation
Signed-off-by: snigupta <snigupta@amd.com>
Signed-off-by: snigupta <snigupta@amd.com>
hackwa
reviewed
May 20, 2026
|
|
||
| 1. If only one context exists, auto-select it. | ||
| 2. If multiple exist, validate all (Active and Idle) with register/program-memory read. | ||
| 3. If no context passes validation, prompt the user (which times out after ``HW_CONTEXT_INPUT_TIMEOUT_S`` seconds and calls ``cleanup_and_exit(args, 1)`` on failure / timeout). |
Member
There was a problem hiding this comment.
this line is too long. please limit each line to 120 characters
hackwa
reviewed
May 20, 2026
| """ | ||
| # Load AIE interface if not provided | ||
| if aie_iface is None: | ||
| aie_iface = loader.load_aie_arch(device) |
Member
There was a problem hiding this comment.
please use args.aie_iface instead of defining a new one.
hackwa
reviewed
May 20, 2026
| # CORE_STATUS register - safe read-only register | ||
| # Device-specific addresses: Telluride=0x38004, PHX/STX=0x32004 | ||
| if "CORE_STATUS" not in aie_iface.Core_registers: | ||
| raise RuntimeError(f"CORE_STATUS register not defined for device {device}") |
Member
There was a problem hiding this comment.
this is unnecessary. we always have CORE_STATUS
hackwa
reviewed
May 20, 2026
| print(f"[INFO] Context {ctx} validated successfully (CORE_STATUS=0x{reg_value:08x})") | ||
| valid_contexts.append((ctx, pid)) | ||
|
|
||
| except Exception as e: |
Member
There was a problem hiding this comment.
in future, it would be good to catch the specific exception
hackwa
reviewed
May 20, 2026
| reg_value = backend.read_register(test_col, test_row, test_reg) | ||
|
|
||
| # This context passed validation | ||
| print(f"[INFO] Context {ctx} validated successfully (CORE_STATUS=0x{reg_value:08x})") |
Member
There was a problem hiding this comment.
can we comment out these prints. these are good for debug but don't give end user any information
hackwa
reviewed
May 20, 2026
| ctx = int(list(current_contexts.keys())[0]) | ||
| pid = int(list(current_contexts.values())[0]["pid"]) | ||
| else: | ||
| print(f"[INFO] Auto-selected single context: {ctx}") |
Member
There was a problem hiding this comment.
can we comment out these prints. these are good for debug but don't give end user any information
Signed-off-by: snigupta <snigupta@amd.com>
hackwa
approved these changes
May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactor
check_hw_context()so MLDebugger can pick the right XRT hardware context when multiple hw ctx are listed byxrt-smi._validate_contexts_with_read()to probe all contexts with a read-onlyCORE_STATUSregister access before attaching.check_hw_context(); validation only returnsNoneor a list of(context_id, pid)pairs that succeeded.Motivation
On Telluride (and similar setups),
xrt-smioften reports several contexts for the same PID. Previously selection could block indefinitely oninput()with no timeout. With this PR, MLDebugger can pick the correct/valid context through a register probe: if MLDebugger can readCORE_STATUSon a context, that handle is at least usable for debug attach.Behavior
xrt-smixrt-smitable; prompt user (60s timeout)xrt-smifailureOn timeout or invalid context ID, call
cleanup_and_exit(args, 1).