Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]#973
Conversation
|
Ubuntu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #973 +/- ##
==========================================
- Coverage 97.11% 97.09% -0.03%
==========================================
Files 141 141
Lines 8110 8110
==========================================
- Hits 7876 7874 -2
- Misses 234 236 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6f6ef26 to
4ac05ac
Compare
Stacked on PR #970 (MOD-14954 x86 kernels). Mirrors x86 structure onto NEON_HP / SVE / SVE2 tiers. Zero CMake changes; reuses existing ARM TU compile flags. Scalar fallback already on main serves as reference. Bakes in PR #970 review lessons (assert(dim>=16), 4-accumulator ILP, formula anchor, load_unaligned<float> metadata, dispatcher-routed tier-walk tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 bite-sized tasks following the spec at 2026-05-28-arm-sq8-fp16-design.md. Each task ends in a commit; assistant runs tests/ASan/benchmarks after the user confirms each ARM build cycle. Zero CMake changes; PR stacks on #970. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…OD-14972] The 9 ARM tier blocks (L2/IP/Cosine × SVE2/SVE/NEON_HP) were missing ASSERT_EQ(alignment, 0) after each ASSERT_NEAR, unlike the SQ8_FP32 sister blocks which assert it. Adds the assertions to lock the contract that ARM tiers leave the caller's alignment value untouched.
…D-14972] svcvt_f32_f16_x (FCVT) reads even-indexed FP16 elements: FP32[e] ← FP16[2e]. The step function loaded chunk consecutive FP16 values into positions 0..chunk-1, then passed them directly to svcvt_f32_f16_x, which picked positions 0,2,4,... and silently skipped positions 1,3,5,... For chunk=4 (128-bit SVE), only 2 of 4 FP16 values per step were used, producing wrong dot products. Fix: svzip1_f16(q_h, zeros) spreads values to even positions [v0,0,v1,0,...] so FCVT correctly reads v[0],v[1],v[2],... Applied to both the full step helper and the partial-chunk path. Discovered and fixed during ARM host verification (Task 14, MOD-14972).
…D-14972] SVE hot loop: replace svzip1_f16+svdup_f16+svwhilelt_b16 (4 ops) with svld1uh_u32 (1 op) — zero-extends each FP16 halfword into a 32-bit lane so svcvt_f32_f16_x reads the correct bits directly. Same fix applied to the partial-chunk path, which also drops the now-redundant pg16_partial predicate. Accumulator combine changed from svadd_f32_x to svadd_f32_z to match the SQ8_FP32 SVE sister. NEON residual: replace the single 8-lane block + up-to-7 software-scalar iterations with three independent 4-lane sub-steps (r>=4, r>=8, r>=12), leaving at most 3 elements for scalar — mirrors the SQ8_FP32 NEON sister exactly. Eliminates expensive vecsim_types::FP16_to_FP32 calls for residuals 4..15 (previously up to 7 software conversions per call). Both IP headers: remove assert()+<cassert> (no sister kernel uses them). Both L2 headers: drop redundant float16.h include and using declarations (arrive transitively through the included IP header).
…MOD-14972] - Remove docs/superpowers/ design and plan files (~1550 lines); sister PR #970 removed its equivalent doc before merge. - Drop 5-line "No alignment write" prose comment from the three AArch64 NEON_HP dispatcher blocks; the sister SQ8_FP32 ARM dispatchers carry no such comment — the absent alignment write already encodes the intent. - Trim GetDistFuncSQ8FP16Asymmetric to a 7-line template-mapping check at dim=15, matching the shape of GetDistFuncSQ8Asymmetric (SQ8_FP32 sister). The scalar-fallback assertion it previously duplicated is already covered by the trailing block of SQ8_FP16_SpacesOptimizationTest.
4ac05ac to
e1647dc
Compare
SQ8↔FP16 ARM kernels: proposed optimizations + benchmark resultsI profiled the ARM SQ8↔FP16 distance kernels from this PR and prototyped two optimizations, benchmarked head-to-head against this PR's code on a Graviton-4 (Neoverse V2) arm64 runner via To keep the comparison clean, both runs are identical except for the kernel changes, and both register
Speedup = baseline CPU-time ÷ treatment CPU-time (>1 = faster). 36 dimension points per tier.
What changed
Why this is trustworthyThe untouched base SVE tier reads 1.00×, confirming the comparison is clean (identical code → identical perf) — so the gains are attributable to the kernels, not run-to-run noise. No regressions anywhere (min speedup ≥ 1.00 in every tier). Happy to fold these into this PR if you'd like — the diffs are small and isolated to the IP/L2 SVE2 headers + the NEON_HP step + the SVE2 chooser. |
The bm_spaces_sq8_fp16 executable is built but was never emitted by benchmarks.sh, so no CI label (bm-spaces / benchmarks-all) would run it. Register it in bm-spaces, bm-spaces-sq8-full, benchmarks-all and benchmarks-default, and add a dedicated bm-spaces-sq8-fp16 case.
…MLALT kernel NEON_HP: widen SQ8 storage uint8->fp16->fp32 via vcvtq_f16_u16 (values 0..255 are exact in FP16), dropping two integer-widening ops per 16-element chunk with identical FP32 lane values. SVE2: dedicated kernel keeping storage+query at 16-bit and using the FMLALB/FMLALT widening multiply-accumulate pair (svmlalb_f32/svmlalt_f32). Processes svcnth() lanes/step (2x the base-SVE svcntw() granularity) and removes explicit query widening/conversion, roughly halving hot-loop loads and instructions. Wired into SVE2.cpp IP/Cosine/L2 choosers at svcnth granularity.
…-14972] Add an asimdfhm-gated NEON_FHM tier for SQ8<->FP16 IP / L2 / Cosine. Instead of widening both operands to FP32 and issuing vfmaq_f32 (the NEON_HP path), it uses vfmlalq_low/high_f16 (FMLAL/FMLAL2) to multiply the FP16 lanes directly into FP32 accumulators, removing all 8 vcvt_f32_f16 per 16 lanes. SQ8 storage is widened uint8->fp16 (exact for 0..255) and the FP16 query is consumed in place. FMLAL widens fp16->fp32 before the multiply, so accuracy matches the scalar baseline. Dispatchers prefer NEON_FHM over NEON_HP when features.asimdfhm is set. The IP core is templated on use_fhm so L2/Cosine and the residual tail are shared. Tier-walk unit tests and microbenchmarks cover the new path. Measured ~1.95x over NEON_HP at high/medium dims on asimdfhm hardware. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Describe the changes in the pull request
Add asymmetric SQ8↔FP16 SIMD distance kernels (IP, L2, Cosine) for ARM tiers: NEON_HP, SVE, SVE2. Stacked on PR #970 (MOD-14954), which delivers the x86 equivalents.
The SVE hot loop uses
svld1uh_u32to zero-extend each FP16 halfword into a 32-bit lane, allowingsvcvt_f32_f16_xto read the correct bits directly. The NEON residual mirrors the SQ8_FP32 NEON sister: three independent 4-lane sub-steps (r>=4/8/12) leaving at most 3 elements for scalar, replacing the previous single 8-lane block + up-to-7 software conversions.Additional improvements (co-authored by @lerman25, originally from PR #975):
u8→u16→u32→f32path (twovmovl_u16+vcvtq_f32_u32) with the shorteru8→u16→f16→f32path (vcvtq_f16_u16+vcvt_f32_f16), reducing instruction count in the main loop.IP_SVE2_SQ8_FP16.h,L2_SVE2_SQ8_FP16.h): usesvmlalb_f32/svmlalt_f32(SVE2-only widening multiply-accumulate) for a tighter inner loop vs the generic SVE path.spaces_sq8_fp16is now wired into all relevant benchmark categories (benchmarks-all,benchmarks-default,bm-spaces-sq8-full,bm-spaces, and the newbm-spaces-sq8-fp16category).Which issues this PR fixes
Main objects this PR modified
src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h— NEON_HP IP kernel (+ widening optimization)src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h— SVE IP kernelsrc/VecSim/spaces/IP/IP_SVE2_SQ8_FP16.h— new dedicated SVE2 IP kernelsrc/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h— NEON_HP L2 kernelsrc/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h— SVE L2 kernelsrc/VecSim/spaces/L2/L2_SVE2_SQ8_FP16.h— new dedicated SVE2 L2 kernelsrc/VecSim/spaces/functions/NEON_HP.{h,cpp},SVE.{h,cpp},SVE2.{h,cpp}— chooser symbolssrc/VecSim/spaces/IP_space.cpp,L2_space.cpp— AArch64 dispatcher blockstests/unit/test_spaces.cpp— tier-walk tests for NEON_HP / SVE / SVE2tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp— ARM microbench registrationstests/benchmark/benchmarks.sh— benchmark category registration forspaces_sq8_fp16Mark if applicable
Note
Medium Risk
Changes hot-path distance math on AArch64 with multiple ISA variants; wrong dispatch or numeric drift could affect search quality, though unit tests compare against scalar baselines.
Overview
Adds ARM64 SIMD for asymmetric SQ8 storage ↔ FP16 query distances (inner product, L2², cosine), complementing existing x86 paths.
New kernels cover NEON half-precision (
asimdhpand optional FHMasimdfhmwith widening FMA), SVE (FP16 loads viasvld1uh_u32), and a dedicated SVE2 path usingsvmlalb_f32/svmlalt_f32for a denser FP16 dot loop. L2 reuses the IP core plus the usual sum-of-squares identity. Runtime dispatch inIP_space.cpp/L2_space.cppprefers SVE2 → SVE → NEON_FHM → NEON_HP whendim ≥ 16.Tests and benchmarks gain tier walk-down checks and register
spaces_sq8_fp16(including NEON_FHM) in the benchmark harness.Reviewed by Cursor Bugbot for commit 3118914. Bugbot is set up for automated code reviews on this repo. Configure here.