Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972] by dor-forer · Pull Request #973 · RedisAI/VectorSimilarity

dor-forer · 2026-05-31T13:01:54Z

Describe the changes in the pull request

Add asymmetric SQ8↔FP16 SIMD distance kernels (IP, L2, Cosine) for ARM tiers: NEON_HP, SVE, SVE2. Stacked on PR #970 (MOD-14954), which delivers the x86 equivalents.

The SVE hot loop uses svld1uh_u32 to zero-extend each FP16 halfword into a 32-bit lane, allowing svcvt_f32_f16_x to read the correct bits directly. The NEON residual mirrors the SQ8_FP32 NEON sister: three independent 4-lane sub-steps (r>=4/8/12) leaving at most 3 elements for scalar, replacing the previous single 8-lane block + up-to-7 software conversions.

Additional improvements (co-authored by @lerman25, originally from PR #975):

NEON_HP widening optimization: replaced the u8→u16→u32→f32 path (two vmovl_u16 + vcvtq_f32_u32) with the shorter u8→u16→f16→f32 path (vcvtq_f16_u16 + vcvt_f32_f16), reducing instruction count in the main loop.
Dedicated SVE2 kernels (IP_SVE2_SQ8_FP16.h, L2_SVE2_SQ8_FP16.h): use svmlalb_f32/svmlalt_f32 (SVE2-only widening multiply-accumulate) for a tighter inner loop vs the generic SVE path.
Benchmark registration: spaces_sq8_fp16 is now wired into all relevant benchmark categories (benchmarks-all, benchmarks-default, bm-spaces-sq8-full, bm-spaces, and the new bm-spaces-sq8-fp16 category).

Which issues this PR fixes

MOD-14972

Main objects this PR modified

src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h — NEON_HP IP kernel (+ widening optimization)
src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h — SVE IP kernel
src/VecSim/spaces/IP/IP_SVE2_SQ8_FP16.h — new dedicated SVE2 IP kernel
src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h — NEON_HP L2 kernel
src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h — SVE L2 kernel
src/VecSim/spaces/L2/L2_SVE2_SQ8_FP16.h — new dedicated SVE2 L2 kernel
src/VecSim/spaces/functions/NEON_HP.{h,cpp}, SVE.{h,cpp}, SVE2.{h,cpp} — chooser symbols
src/VecSim/spaces/IP_space.cpp, L2_space.cpp — AArch64 dispatcher blocks
tests/unit/test_spaces.cpp — tier-walk tests for NEON_HP / SVE / SVE2
tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp — ARM microbench registrations
tests/benchmark/benchmarks.sh — benchmark category registration for spaces_sq8_fp16

Mark if applicable

This PR introduces API changes
This PR introduces serialization changes

Note

Medium Risk
Changes hot-path distance math on AArch64 with multiple ISA variants; wrong dispatch or numeric drift could affect search quality, though unit tests compare against scalar baselines.

Overview
Adds ARM64 SIMD for asymmetric SQ8 storage ↔ FP16 query distances (inner product, L2², cosine), complementing existing x86 paths.

New kernels cover NEON half-precision (asimdhp and optional FHM asimdfhm with widening FMA), SVE (FP16 loads via svld1uh_u32), and a dedicated SVE2 path using svmlalb_f32 / svmlalt_f32 for a denser FP16 dot loop. L2 reuses the IP core plus the usual sum-of-squares identity. Runtime dispatch in IP_space.cpp / L2_space.cpp prefers SVE2 → SVE → NEON_FHM → NEON_HP when dim ≥ 16.

Tests and benchmarks gain tier walk-down checks and register spaces_sq8_fp16 (including NEON_FHM) in the benchmark harness.

^{Reviewed by Cursor Bugbot for commit 3118914. Bugbot is set up for automated code reviews on this repo. Configure here.}

CLAassistant · 2026-05-31T13:02:02Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ lerman25
✅ dor-forer
❌ Ubuntu

Ubuntu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

jit-ci · 2026-05-31T13:03:37Z

🛡️ Jit Security Scan Results

✅ No security findings were detected in this PR

^{Security scan by Jit}

codecov · 2026-05-31T13:22:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.09%. Comparing base (daf391f) to head (3118914).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #973      +/-   ##
==========================================
- Coverage   97.11%   97.09%   -0.03%     
==========================================
  Files         141      141              
  Lines        8110     8110              
==========================================
- Hits         7876     7874       -2     
- Misses        234      236       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Stacked on PR #970 (MOD-14954 x86 kernels). Mirrors x86 structure onto NEON_HP / SVE / SVE2 tiers. Zero CMake changes; reuses existing ARM TU compile flags. Scalar fallback already on main serves as reference. Bakes in PR #970 review lessons (assert(dim>=16), 4-accumulator ILP, formula anchor, load_unaligned<float> metadata, dispatcher-routed tier-walk tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

14 bite-sized tasks following the spec at 2026-05-28-arm-sq8-fp16-design.md. Each task ends in a commit; assistant runs tests/ASan/benchmarks after the user confirms each ARM build cycle. Zero CMake changes; PR stacks on #970. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…OD-14972] The 9 ARM tier blocks (L2/IP/Cosine × SVE2/SVE/NEON_HP) were missing ASSERT_EQ(alignment, 0) after each ASSERT_NEAR, unlike the SQ8_FP32 sister blocks which assert it. Adds the assertions to lock the contract that ARM tiers leave the caller's alignment value untouched.

…D-14972] svcvt_f32_f16_x (FCVT) reads even-indexed FP16 elements: FP32[e] ← FP16[2e]. The step function loaded chunk consecutive FP16 values into positions 0..chunk-1, then passed them directly to svcvt_f32_f16_x, which picked positions 0,2,4,... and silently skipped positions 1,3,5,... For chunk=4 (128-bit SVE), only 2 of 4 FP16 values per step were used, producing wrong dot products. Fix: svzip1_f16(q_h, zeros) spreads values to even positions [v0,0,v1,0,...] so FCVT correctly reads v[0],v[1],v[2],... Applied to both the full step helper and the partial-chunk path. Discovered and fixed during ARM host verification (Task 14, MOD-14972).

…D-14972] SVE hot loop: replace svzip1_f16+svdup_f16+svwhilelt_b16 (4 ops) with svld1uh_u32 (1 op) — zero-extends each FP16 halfword into a 32-bit lane so svcvt_f32_f16_x reads the correct bits directly. Same fix applied to the partial-chunk path, which also drops the now-redundant pg16_partial predicate. Accumulator combine changed from svadd_f32_x to svadd_f32_z to match the SQ8_FP32 SVE sister. NEON residual: replace the single 8-lane block + up-to-7 software-scalar iterations with three independent 4-lane sub-steps (r>=4, r>=8, r>=12), leaving at most 3 elements for scalar — mirrors the SQ8_FP32 NEON sister exactly. Eliminates expensive vecsim_types::FP16_to_FP32 calls for residuals 4..15 (previously up to 7 software conversions per call). Both IP headers: remove assert()+<cassert> (no sister kernel uses them). Both L2 headers: drop redundant float16.h include and using declarations (arrive transitively through the included IP header).

…MOD-14972] - Remove docs/superpowers/ design and plan files (~1550 lines); sister PR #970 removed its equivalent doc before merge. - Drop 5-line "No alignment write" prose comment from the three AArch64 NEON_HP dispatcher blocks; the sister SQ8_FP32 ARM dispatchers carry no such comment — the absent alignment write already encodes the intent. - Trim GetDistFuncSQ8FP16Asymmetric to a 7-line template-mapping check at dim=15, matching the shape of GetDistFuncSQ8Asymmetric (SQ8_FP32 sister). The scalar-fallback assertion it previously duplicated is already covered by the trailing block of SQ8_FP16_SpacesOptimizationTest.

lerman25 · 2026-06-01T19:57:31Z

SQ8↔FP16 ARM kernels: proposed optimizations + benchmark results

I profiled the ARM SQ8↔FP16 distance kernels from this PR and prototyped two optimizations, benchmarked head-to-head against this PR's code on a Graviton-4 (Neoverse V2) arm64 runner via bm-spaces.

To keep the comparison clean, both runs are identical except for the kernel changes, and both register spaces_sq8_fp16 in benchmarks.sh (it was built but never emitted by any CI setup, so the kernels had no benchmark coverage):

Baseline (control): Pr 973 bench baseline #974 — this PR + benchmark registration
Treatment (proposed opts): Pr 973 bench treatment #975 — baseline + NEON_HP widening + SVE2 FMLALB/FMLALT kernel

Speedup = baseline CPU-time ÷ treatment CPU-time (>1 = faster). 36 dimension points per tier.

Tier	Metric	Median speedup	Range	Median ns (base → treat)
SVE2	L2	1.83×	1.02–1.88	93.6 → 51.0
SVE2	IP	1.84×	1.00–1.89	93.6 → 50.9
SVE2	Cosine	1.84×	1.00–1.89	93.6 → 50.8
NEON_HP	L2	1.05×	1.02–1.15	103.0 → 97.7
NEON_HP	IP	1.05×	1.02–1.09	103.0 → 97.8
NEON_HP	Cosine	1.05×	1.02–1.10	103.0 → 97.7
SVE (untouched control)	all	1.00×	1.00–1.01	unchanged

What changed

SVE2 (~1.83–1.84×, near the theoretical 2×): a dedicated kernel using the FMLALB/FMLALT widening multiply-accumulate pair (svmlalb_f32/svmlalt_f32), keeping storage+query at 16 bits and processing svcnth lanes/step vs the base svcntw — doubling throughput and removing explicit query widening. By bucket: high_dim 1.86×, residual 1.85×, low_dim 1.55–1.58× (per-call overhead dilutes small dims), 512-bit chunks median 1.66× with min ~1.00× at Dimension:16 (single chunk → nothing to amortize, no regression).
NEON_HP (steady ~1.05×): widen SQ8 storage uint8 → fp16 → fp32 via vcvtq_f16_u16 (values 0..255 are exact in FP16), dropping two integer-widening ops per 16-element chunk with identical FP32 lane values.

Why this is trustworthy

The untouched base SVE tier reads 1.00×, confirming the comparison is clean (identical code → identical perf) — so the gains are attributable to the kernels, not run-to-run noise. No regressions anywhere (min speedup ≥ 1.00 in every tier).

Happy to fold these into this PR if you'd like — the diffs are small and isolated to the IP/L2 SVE2 headers + the NEON_HP step + the SVE2 chooser.

The bm_spaces_sq8_fp16 executable is built but was never emitted by benchmarks.sh, so no CI label (bm-spaces / benchmarks-all) would run it. Register it in bm-spaces, bm-spaces-sq8-full, benchmarks-all and benchmarks-default, and add a dedicated bm-spaces-sq8-fp16 case.

…MLALT kernel NEON_HP: widen SQ8 storage uint8->fp16->fp32 via vcvtq_f16_u16 (values 0..255 are exact in FP16), dropping two integer-widening ops per 16-element chunk with identical FP32 lane values. SVE2: dedicated kernel keeping storage+query at 16-bit and using the FMLALB/FMLALT widening multiply-accumulate pair (svmlalb_f32/svmlalt_f32). Processes svcnth() lanes/step (2x the base-SVE svcntw() granularity) and removes explicit query widening/conversion, roughly halving hot-loop loads and instructions. Wired into SVE2.cpp IP/Cosine/L2 choosers at svcnth granularity.

…-14972] Add an asimdfhm-gated NEON_FHM tier for SQ8<->FP16 IP / L2 / Cosine. Instead of widening both operands to FP32 and issuing vfmaq_f32 (the NEON_HP path), it uses vfmlalq_low/high_f16 (FMLAL/FMLAL2) to multiply the FP16 lanes directly into FP32 accumulators, removing all 8 vcvt_f32_f16 per 16 lanes. SQ8 storage is widened uint8->fp16 (exact for 0..255) and the FP16 query is consumed in place. FMLAL widens fp16->fp32 before the multiply, so accuracy matches the scalar baseline. Dispatchers prefer NEON_FHM over NEON_HP when features.asimdfhm is set. The IP core is templated on use_fhm so L2/Cosine and the residual tail are shared. Tier-walk unit tests and microbenchmarks cover the new path. Measured ~1.95x over NEON_HP at high/medium dims on asimdfhm hardware. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dor-forer added the benchmarks-all label May 31, 2026

dor-forer requested review from BenGoldberger, lerman25 and ofiryanai May 31, 2026 14:44

dor-forer force-pushed the dor-forer-sq8-fp16-arm-kernels-mod-14972 branch from 6f6ef26 to 4ac05ac Compare June 1, 2026 15:01

Base automatically changed from dor-forer-sq8-fp16-x86-kernels-mod-14954 to main June 1, 2026 15:10

dor-forer and others added 19 commits June 1, 2026 15:12

Add NEON_HP SQ8↔FP16 IP kernel header [MOD-14972]

4f0534c

Add NEON_HP SQ8↔FP16 L2 kernel header [MOD-14972]

d3c6415

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Wire NEON_HP SQ8↔FP16 choosers [MOD-14972]

69cee3d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Dispatch SQ8↔FP16 to NEON_HP tier on AArch64 [MOD-14972]

1b36b38

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extend SQ8↔FP16 tier-walk tests with NEON_HP [MOD-14972]

1af4812

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add SVE SQ8↔FP16 IP kernel header [MOD-14972]

0ce0bce

Add SVE SQ8↔FP16 L2 kernel header [MOD-14972]

eb4952a

Wire SVE/SVE2 SQ8↔FP16 choosers [MOD-14972]

fcb01bb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Dispatch SQ8↔FP16 to SVE/SVE2 tiers on AArch64 [MOD-14972]

15fca69

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extend SQ8↔FP16 tier-walk tests with SVE/SVE2 [MOD-14972]

0fcd7d0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Register ARM SQ8↔FP16 microbenchmarks [MOD-14972]

6a783f8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Apply clang-format [MOD-14972]

10c03aa

Apply clang-format 18.1.8 (matches CI) [MOD-14972]

e1647dc

dor-forer force-pushed the dor-forer-sq8-fp16-arm-kernels-mod-14972 branch from 4ac05ac to e1647dc Compare June 1, 2026 15:23

lerman25 mentioned this pull request Jun 1, 2026

Pr 973 bench treatment #975

Draft

2 tasks

lerman25 added 3 commits June 2, 2026 07:19

style: clang-format SVE2.cpp

ed1860c

lerman25 mentioned this pull request Jun 2, 2026

Pr 973 bench FMLAL (FEAT_FHM) #976

Draft

lerman25 approved these changes Jun 2, 2026

View reviewed changes

dor-forer enabled auto-merge June 2, 2026 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]#973

Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]#973
dor-forer wants to merge 23 commits into
mainfrom
dor-forer-sq8-fp16-arm-kernels-mod-14972

dor-forer commented May 31, 2026 •

edited by cursor Bot

Loading

Uh oh!

CLAassistant commented May 31, 2026 •

edited

Loading

Uh oh!

jit-ci Bot commented May 31, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 31, 2026 •

edited

Loading

Uh oh!

lerman25 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dor-forer commented May 31, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jit-ci Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛡️ Jit Security Scan Results

Uh oh!

codecov Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lerman25 commented Jun 1, 2026

SQ8↔FP16 ARM kernels: proposed optimizations + benchmark results

What changed

Why this is trustworthy

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dor-forer commented May 31, 2026 •

edited by cursor Bot

Loading

CLAassistant commented May 31, 2026 •

edited

Loading

jit-ci Bot commented May 31, 2026 •

edited

Loading

codecov Bot commented May 31, 2026 •

edited

Loading