Improve performance on edges of GEMM for RISC-V by ChipKerchner · Pull Request #5674 · OpenMathLib/OpenBLAS

ChipKerchner · 2026-03-12T12:46:25Z

Improve performance on edges of GEMM for RISC-V - up to 9X faster.

ChipKerchner · 2026-03-12T12:47:41Z

Small values of M and/or N are useful for situations like AI inferencing.

ChipKerchner · 2026-03-12T12:50:47Z

This increases the utilization to nearer peak numbers.

ChipKerchner · 2026-03-12T13:02:17Z

@martin-frbg What's going on here?

TEST 126/128 fork:safety Attempt 10 timed out (retrying...) All 10 attempts failed, giving up.

ChipKerchner · 2026-03-12T13:33:24Z

This has nothing to do with this patch but could we limit the number of failures (or differences) to a reasonable number (like 200 or so)? It makes for some huge log files and makes looking for failures difficult.

2026-03-12T13:22:24.3299509Z CC 261.500000 DD 261.181854 2026-03-12T13:22:24.3328792Z CC 268.875000 DD 269.182281 2026-03-12T13:22:24.3329295Z CC 259.125000 DD 258.630676 2026-03-12T13:22:24.3330928Z CC 256.062500 DD 256.341339 2026-03-12T13:22:24.3332162Z CC 261.937500 DD 262.319489 2026-03-12T13:22:24.3373825Z SHGEMM FAILURES: 527559!!! 2026-03-12T13:23:13.9093248Z BGEMM FAILURES: 224

BTW, these are probably reasonable differences for SHGEMM - since they are only off in the last bit.

martin-frbg · 2026-03-12T14:54:26Z

@martin-frbg What's going on here?

TEST 126/128 fork:safety Attempt 10 timed out (retrying...) All 10 attempts failed, giving up.

If it's C910V, that's a (qemu) thread lockup in a forked process that is not reproducible on actual hardware so far

ChipKerchner · 2026-03-12T14:56:02Z

Ok, let me report the QEMU issue to my team members.

martin-frbg · 2026-03-12T14:57:55Z

This has nothing to do with this patch but could we limit the number of failures (or differences) to a reasonable number (like 200 or so)? It makes for some huge log files and makes looking for failures difficult.

2026-03-12T13:22:24.3299509Z CC 261.500000 DD 261.181854 2026-03-12T13:22:24.3328792Z CC 268.875000 DD 269.182281 2026-03-12T13:22:24.3329295Z CC 259.125000 DD 258.630676 2026-03-12T13:22:24.3330928Z CC 256.062500 DD 256.341339 2026-03-12T13:22:24.3332162Z CC 261.937500 DD 262.319489 2026-03-12T13:22:24.3373825Z SHGEMM FAILURES: 527559!!! 2026-03-12T13:23:13.9093248Z BGEMM FAILURES: 224

BTW, these are probably reasonable differences for SHGEMM - since they are only off in the last bit.

That's probably why we had them printed at some point, to see if they're all reasonable (and the test criteria need to be adjusted) ?

ChipKerchner · 2026-03-12T17:28:44Z

FYI - in the graphs above M and N are modulo the block size - in this case 16x8. But the biggest performance gains are for small M values.

martin-frbg · 2026-03-12T20:07:08Z

Ok, let me report the QEMU issue to my team members.

Unfortunately it's a bit complicated - stock qemu doesn't support C910V (to my knowledge) so CI is using some Xuantie fork of qemu9 that may or may not be actively maintained.

ChipKerchner · 2026-03-12T20:24:13Z

Since C910V only supports RVV 0.7.1 - maybe this build shouldn't get kicked off on every check-in. RVV 1.0 has been ratified about 3 years ago. Unfortunately I guess until there is more silicon, maybe this is needed?

martin-frbg · 2026-03-12T21:18:55Z

Since C910V only supports RVV 0.7.1 - maybe this build shouldn't get kicked off on every check-in. RVV 1.0 has been ratified about 3 years ago. Unfortunately I guess until there is more silicon, maybe this is needed?

I wonder how many (other) early adopters bought a MilkV Pioneer or similar... by the time everybody has dumped them, we're probably looking at RVV1.5 or the like and the cycle begins anew. I guess I could look into disabling the utest in CI so that we get at least the build and BLAS tests

ChipKerchner · 2026-03-12T22:04:55Z

The way I hear after RVA23 there won't be a major upgrade until RVA30 (?). There may be a minor one though. But there are plenty of optional extension to be used and are being ratified.

I think RVV 1.0 is the way to go. RVV 0.7.1 was mainly a hack.

martin-frbg · 2026-03-12T22:24:21Z

reminds me of Felix LeClerc's FOSDEM talk... we'll see how RISCV64 standards evolve but the earlier generation(s?) of miscreants probably won't go away soon

ChipKerchner · 2026-03-16T14:31:35Z

kernel/riscv64/sgemm_kernel_16x8_zvl256b.c

-
+    const BLASLONG m_edge = M & 15;
+    const bool S = (M == (ldc & 0xF));
+    if (K <= 0) return 0;


There was actually a bug in the original for K = 0. It would add the first A times B onto C when it should not.

Good catch, very unfortunate that the tests apparently did not catch it

Maybe the check length is related to K (?) - which doesn't make sense since it should check the whole M x N input/output. Either that or something isn't calling GEMM for K = 0

Ah right, there is an early exit for K=0 (or alpha=0) in the level3 driver (driver/level3/level3.c or driver/level3/level3_thread.c depending on whether multithreading or not) that must have taken care of it.
(Can't hurt to have it correct in the kernel too, in case the driver code changes for whatever reason)

ChipKerchner · 2026-03-19T13:17:40Z

It is possible that the failures with BGEMM are due to bfloat16 rounding of C. I mentioned this priorly since I think this is incorrect. @Mousius

float res0;
FLOAT *C0;
C0[0] = TO_OUTPUT(TO_F32(C0[0])+res0);

For BGEMM

#define TO_F32(x) (bfloat16tof32(x))
#define TO_OUTPUT(x) (f32tobfloat16(x))

This causes double (2X) rounding.

ChipKerchner added 2 commits March 11, 2026 21:07

Merge remote-tracking branch 'origin/develop' into HEAD

548a9f3

Fast performing edges for FP32 GEMM of RVV.

376d3a1

ChipKerchner marked this pull request as draft March 12, 2026 12:46

Add bool types for C.

6d6af1d

ChipKerchner added 5 commits March 13, 2026 15:59

Add K-unrolling to M = 8. Other small changes.

9c16449

Unroll K for N less than or equal to 4.

fda433f

Common unroll code.

eb9bbcc

Preserve K.

b0ee407

Better K.

010f24f

ChipKerchner commented Mar 16, 2026

View reviewed changes

ChipKerchner added 5 commits March 16, 2026 21:32

Global optimizations.

f927b94

Use mf2 instead of m1.

79d9fe3

Simplier loops.

477dd40

More global optimzation and clean up.

d832ee5

Merge remote-tracking branch 'origin/develop' into fasterRVVEdges

1e48686

Conversation

ChipKerchner commented Mar 12, 2026

Uh oh!

ChipKerchner commented Mar 12, 2026

Uh oh!

ChipKerchner commented Mar 12, 2026

Uh oh!

ChipKerchner commented Mar 12, 2026

Uh oh!

ChipKerchner commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Mar 12, 2026

Uh oh!

ChipKerchner commented Mar 12, 2026

Uh oh!

martin-frbg commented Mar 12, 2026

Uh oh!

ChipKerchner commented Mar 12, 2026

Uh oh!

martin-frbg commented Mar 12, 2026

Uh oh!

ChipKerchner commented Mar 12, 2026

Uh oh!

martin-frbg commented Mar 12, 2026

Uh oh!

ChipKerchner commented Mar 12, 2026

Uh oh!

martin-frbg commented Mar 12, 2026

Uh oh!

ChipKerchner Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-frbg Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

ChipKerchner Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-frbg Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

ChipKerchner commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChipKerchner commented Mar 12, 2026 •

edited

Loading

ChipKerchner Mar 16, 2026 •

edited

Loading

ChipKerchner Mar 16, 2026 •

edited

Loading

ChipKerchner commented Mar 19, 2026 •

edited

Loading