Hello!
I’m interested in NVFP4 W4A4 GEMM kernels, and while looking into related work I came across Qutlass, so I wanted to ask a question. First of all, thank you for releasing such an impressive piece of work — I think it will be very helpful for my research.
What I’m curious about is how global scaling is handled in NVFP4. As far as I understand, Qutlass is based on CUTLASS and performs NVFP4 @ NVFP4 GEMM operations.
Focusing on the weights only: if the quantized weights, local scales, and global scales are all pre-computed, I could confirm that the dequantization between the local scale and quantized weights is processed as block-scaled on the tensor cores. However, I haven’t been able to figure out how the global scale is applied/handled in this process, so I’m reaching out to ask.
Could you possibly provide some clarification on this?
Thank you!
Hello!
I’m interested in NVFP4 W4A4 GEMM kernels, and while looking into related work I came across Qutlass, so I wanted to ask a question. First of all, thank you for releasing such an impressive piece of work — I think it will be very helpful for my research.
What I’m curious about is how global scaling is handled in NVFP4. As far as I understand, Qutlass is based on CUTLASS and performs NVFP4 @ NVFP4 GEMM operations.
Focusing on the weights only: if the quantized weights, local scales, and global scales are all pre-computed, I could confirm that the dequantization between the local scale and quantized weights is processed as block-scaled on the tensor cores. However, I haven’t been able to figure out how the global scale is applied/handled in this process, so I’m reaching out to ask.
Could you possibly provide some clarification on this?
Thank you!