Skip to content

add no-copy hermitian transpose support to matmul#1173

Open
simonbyrne wants to merge 2 commits intomainfrom
sbyrne/hermitian
Open

add no-copy hermitian transpose support to matmul#1173
simonbyrne wants to merge 2 commits intomainfrom
sbyrne/hermitian

Conversation

@simonbyrne
Copy link
Copy Markdown
Collaborator

At the moment, calling matmul(hermitianT(A),B) or matmul(conj(transpose_matrix(A)), B)) will instantiate a temporary intermediate matrix. This changes it to directly call the appropriate cuBLAS (similar to how matmul(transpose_matrix(A),B) is handled).

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@simonbyrne
Copy link
Copy Markdown
Collaborator Author

/build

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR avoids materializing an intermediate matrix when hermitianT(A) or conj(bare_tensor) is passed as a matmul operand by introducing a WithMatmulOperand dispatch layer that extracts the inner tensor and forwards the conjugate-transpose intent to cuBLASLt via CUBLAS_OP_C. It also cleanly separates memory-order tracking (orderA/B/C) from the cuBLAS operation flags (opA/opB), which was previously conflated.

  • hermitianT(A) fast path: extracts the inner tensor view, passes it directly with CUBLAS_OP_C — genuinely zero-copy.
  • conj(tensor_view) fast path: creates transpose_matrix(input) (a TransposeMatrixOp) and passes it to getCublasSupportedTensor, but GetSupportedTensor assumes any transform op has already completed PreRun and calls in.Data() directly — yielding a null GPU pointer since PreRun was never invoked here.
  • conj(transpose_matrix(A)) and conj(hermitianT(A)) still fall to the else branch (materializing a copy), which is correct and tested.

Confidence Score: 3/5

The hermitianT(A) fast path is correct and zero-copy, but the conj(bare tensor_view) branch hands cuBLASLt a null GPU pointer and will crash at runtime.

The is_conj_tensor_view_unary_op_v branch constructs a TransposeMatrixOp inline without calling PreRun, then passes it to getCublasSupportedTensor. GetSupportedTensor assumes transform ops have already run and reads in.Data() directly, which is null here. No test exercises matmul(conj(a), b) with a raw tensor, so the defect goes undetected. The rest of the change — hermitianT dispatch, orderA/B/C separation, cache-key updates — is clean.

include/matx/transforms/matmul/matmul_cuda.h — the is_conj_tensor_view_unary_op_v branch of WithMatmulOperand (lines 1316-1322).

Important Files Changed

Filename Overview
include/matx/transforms/matmul/matmul_cuda.h Core change: separates memory-order tracking (orderA/B/C) from cuBLAS operation flags (opA/opB), adds WithMatmulOperand dispatch that handles hermitianT(A) correctly but has a null-pointer bug in the conj(tensor_view) branch due to using an unrun TransposeMatrixOp.
include/matx/operators/hermitian.h Adds Input() accessor to HermitianTransOp so the inner operand can be extracted without a copy in the matmul fast path.
include/matx/operators/unary_operators.h Adds Input() accessor to matxUnaryOp to expose the inner operand for type-trait inspection in matmul dispatch.
test/00_transform/MatMul.cu New hermitian-transpose tests cover hermitianT(A/B), conj(transpose_matrix(A/B)), batched hermitianT, and real-type hermitianT; the conj(bare_tensor) path introduced in WithMatmulOperand has no corresponding test.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["matmul_impl(C, A_, B_)"] --> B{"c is col-major?"}
    B -- yes --> C["matmul_impl with transposed args"]
    B -- no --> D["WithMatmulOperand(A_)"]
    D --> E{"can_use_metadata_op?"}
    E -- no --> F["getCublasSupportedTensor(A_) + opA=N"]
    E -- yes --> G{"is_hermitian_trans_op_v?"}
    G -- yes --> H["extract inner tensor\nopA=CUBLAS_OP_C\nzero-copy"]
    G -- no --> I{"is_conj_tensor_view_unary_op_v?"}
    I -- yes --> J["transpose_matrix(input)\nPreRun not called - null ptr\nopA=CUBLAS_OP_C"]
    I -- no --> F
    H --> K["WithMatmulOperand(B_)"]
    F --> K
    J --> K
    K --> L["MatMulCUDAExecPrepared"]
    L --> M["GetGemmParams\norderA/B from strides\nopA/opB from requested"]
    M --> N["cuBLASLt / CUTLASS"]
Loading

Reviews (2): Last reviewed commit: "fixes for reviews" | Re-trigger Greptile

Comment thread include/matx/transforms/matmul/matmul_cuda.h
Comment thread include/matx/transforms/matmul/matmul_cuda.h Outdated
@simonbyrne
Copy link
Copy Markdown
Collaborator Author

/build

Comment thread include/matx/transforms/matmul/matmul_cuda.h
@coveralls
Copy link
Copy Markdown

coveralls commented May 8, 2026

Coverage Status

Coverage is 94.316%sbyrne/hermitian into main. No base build found for main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants