add no-copy hermitian transpose support to matmul#1173
add no-copy hermitian transpose support to matmul#1173simonbyrne wants to merge 2 commits intomainfrom
Conversation
|
/build |
Greptile SummaryThis PR avoids materializing an intermediate matrix when
Confidence Score: 3/5The hermitianT(A) fast path is correct and zero-copy, but the conj(bare tensor_view) branch hands cuBLASLt a null GPU pointer and will crash at runtime. The include/matx/transforms/matmul/matmul_cuda.h — the is_conj_tensor_view_unary_op_v branch of WithMatmulOperand (lines 1316-1322). Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["matmul_impl(C, A_, B_)"] --> B{"c is col-major?"}
B -- yes --> C["matmul_impl with transposed args"]
B -- no --> D["WithMatmulOperand(A_)"]
D --> E{"can_use_metadata_op?"}
E -- no --> F["getCublasSupportedTensor(A_) + opA=N"]
E -- yes --> G{"is_hermitian_trans_op_v?"}
G -- yes --> H["extract inner tensor\nopA=CUBLAS_OP_C\nzero-copy"]
G -- no --> I{"is_conj_tensor_view_unary_op_v?"}
I -- yes --> J["transpose_matrix(input)\nPreRun not called - null ptr\nopA=CUBLAS_OP_C"]
I -- no --> F
H --> K["WithMatmulOperand(B_)"]
F --> K
J --> K
K --> L["MatMulCUDAExecPrepared"]
L --> M["GetGemmParams\norderA/B from strides\nopA/opB from requested"]
M --> N["cuBLASLt / CUTLASS"]
Reviews (2): Last reviewed commit: "fixes for reviews" | Re-trigger Greptile |
d4adbb6 to
c283249
Compare
|
/build |
At the moment, calling
matmul(hermitianT(A),B)ormatmul(conj(transpose_matrix(A)), B))will instantiate a temporary intermediate matrix. This changes it to directly call the appropriate cuBLAS (similar to howmatmul(transpose_matrix(A),B)is handled).