Improve performance on edges of GEMM for RISC-V#5674
Improve performance on edges of GEMM for RISC-V#5674ChipKerchner wants to merge 13 commits intoOpenMathLib:developfrom
Conversation
|
Small values of M and/or N are useful for situations like AI inferencing. |
|
This increases the utilization to nearer peak numbers. |
|
@martin-frbg What's going on here?
|
|
This has nothing to do with this patch but could we limit the number of failures (or differences) to a reasonable number (like 200 or so)? It makes for some huge log files and makes looking for failures difficult.
BTW, these are probably reasonable differences for SHGEMM - since they are only off in the last bit. |
If it's C910V, that's a (qemu) thread lockup in a forked process that is not reproducible on actual hardware so far |
|
Ok, let me report the QEMU issue to my team members. |
That's probably why we had them printed at some point, to see if they're all reasonable (and the test criteria need to be adjusted) ? |
|
FYI - in the graphs above M and N are modulo the block size - in this case 16x8. But the biggest performance gains are for small M values. |
Unfortunately it's a bit complicated - stock qemu doesn't support C910V (to my knowledge) so CI is using some Xuantie fork of qemu9 that may or may not be actively maintained. |
|
Since C910V only supports RVV 0.7.1 - maybe this build shouldn't get kicked off on every check-in. RVV 1.0 has been ratified about 3 years ago. Unfortunately I guess until there is more silicon, maybe this is needed? |
I wonder how many (other) early adopters bought a MilkV Pioneer or similar... by the time everybody has dumped them, we're probably looking at RVV1.5 or the like and the cycle begins anew. I guess I could look into disabling the utest in CI so that we get at least the build and BLAS tests |
|
The way I hear after RVA23 there won't be a major upgrade until RVA30 (?). There may be a minor one though. But there are plenty of optional extension to be used and are being ratified. I think RVV 1.0 is the way to go. RVV 0.7.1 was mainly a hack. |
|
reminds me of Felix LeClerc's FOSDEM talk... we'll see how RISCV64 standards evolve but the earlier generation(s?) of miscreants probably won't go away soon |
|
|
||
| const BLASLONG m_edge = M & 15; | ||
| const bool S = (M == (ldc & 0xF)); | ||
| if (K <= 0) return 0; |
There was a problem hiding this comment.
There was actually a bug in the original for K = 0. It would add the first A times B onto C when it should not.
There was a problem hiding this comment.
Good catch, very unfortunate that the tests apparently did not catch it
There was a problem hiding this comment.
Maybe the check length is related to K (?) - which doesn't make sense since it should check the whole M x N input/output. Either that or something isn't calling GEMM for K = 0
There was a problem hiding this comment.
Ah right, there is an early exit for K=0 (or alpha=0) in the level3 driver (driver/level3/level3.c or driver/level3/level3_thread.c depending on whether multithreading or not) that must have taken care of it.
(Can't hurt to have it correct in the kernel too, in case the driver code changes for whatever reason)
|
It is possible that the failures with BGEMM are due to bfloat16 rounding of C. I mentioned this priorly since I think this is incorrect. @Mousius float res0; For BGEMM #define TO_F32(x) (bfloat16tof32(x)) This causes double (2X) rounding. |
Improve performance on edges of GEMM for RISC-V - up to 9X faster.