Skip to content

Pr 973 bench FMLAL 8-accumulator unroll#977

Draft
lerman25 wants to merge 25 commits into
mainfrom
pr-973-bench-fmlal-x2
Draft

Pr 973 bench FMLAL 8-accumulator unroll#977
lerman25 wants to merge 25 commits into
mainfrom
pr-973-bench-fmlal-x2

Conversation

@lerman25
Copy link
Copy Markdown
Collaborator

@lerman25 lerman25 commented Jun 2, 2026

SQ8↔FP16 NEON FMLAL — 8-accumulator unroll (benchmark treatment)

Draft for benchmarking only. Builds on the FMLAL tier in #976; this is an ILP experiment.

What changed

The FHM main loop previously processed 16 lanes/iter into 4 FP32 accumulators
(sum0..3), so only 4 independent dependency chains were in flight — FMLAL latency
(~4 cyc on Neoverse V2) was only partially hidden. This now processes two 16-lane
steps per iteration into 8 accumulators (sum0..7), with an odd-16-chunk tail, merging
sum4..7 back into sum0..3 before the shared residual/reduction. This matches the ILP the
SVE2 kernel already uses (4 steps × even/odd = 8 accumulators).

Verified the compiler keeps 8 distinct accumulator registers (v0–v7) and still emits
fmlal/fmlal2. Precision is unchanged (same per-lane products, only the partial-sum
grouping differs — within test tolerance). The NEON_HP (non-FHM) path is untouched.

Benchmarking

Labeled bm-spaces. The NEON_FHM numbers here are directly comparable to the NEON_FHM
numbers in #976 (same harness/hardware) — that's the A/B: new 8-acc FHM vs prior 4-acc FHM.


Co-authored by Augment Code

dor-forer and others added 25 commits June 1, 2026 15:12
Stacked on PR #970 (MOD-14954 x86 kernels). Mirrors x86 structure
onto NEON_HP / SVE / SVE2 tiers. Zero CMake changes; reuses existing
ARM TU compile flags. Scalar fallback already on main serves as
reference. Bakes in PR #970 review lessons (assert(dim>=16),
4-accumulator ILP, formula anchor, load_unaligned<float> metadata,
dispatcher-routed tier-walk tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 bite-sized tasks following the spec at 2026-05-28-arm-sq8-fp16-design.md.
Each task ends in a commit; assistant runs tests/ASan/benchmarks after the
user confirms each ARM build cycle. Zero CMake changes; PR stacks on #970.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…OD-14972]

The 9 ARM tier blocks (L2/IP/Cosine × SVE2/SVE/NEON_HP) were missing
ASSERT_EQ(alignment, 0) after each ASSERT_NEAR, unlike the SQ8_FP32
sister blocks which assert it. Adds the assertions to lock the contract
that ARM tiers leave the caller's alignment value untouched.
…D-14972]

svcvt_f32_f16_x (FCVT) reads even-indexed FP16 elements: FP32[e] ← FP16[2e].
The step function loaded chunk consecutive FP16 values into positions 0..chunk-1,
then passed them directly to svcvt_f32_f16_x, which picked positions 0,2,4,...
and silently skipped positions 1,3,5,...  For chunk=4 (128-bit SVE), only 2 of
4 FP16 values per step were used, producing wrong dot products.

Fix: svzip1_f16(q_h, zeros) spreads values to even positions [v0,0,v1,0,...] so
FCVT correctly reads v[0],v[1],v[2],...  Applied to both the full step helper
and the partial-chunk path.

Discovered and fixed during ARM host verification (Task 14, MOD-14972).
…D-14972]

SVE hot loop: replace svzip1_f16+svdup_f16+svwhilelt_b16 (4 ops) with
svld1uh_u32 (1 op) — zero-extends each FP16 halfword into a 32-bit lane
so svcvt_f32_f16_x reads the correct bits directly. Same fix applied to
the partial-chunk path, which also drops the now-redundant pg16_partial
predicate. Accumulator combine changed from svadd_f32_x to svadd_f32_z
to match the SQ8_FP32 SVE sister.

NEON residual: replace the single 8-lane block + up-to-7 software-scalar
iterations with three independent 4-lane sub-steps (r>=4, r>=8, r>=12),
leaving at most 3 elements for scalar — mirrors the SQ8_FP32 NEON sister
exactly. Eliminates expensive vecsim_types::FP16_to_FP32 calls for
residuals 4..15 (previously up to 7 software conversions per call).

Both IP headers: remove assert()+<cassert> (no sister kernel uses them).
Both L2 headers: drop redundant float16.h include and using declarations
(arrive transitively through the included IP header).
…MOD-14972]

- Remove docs/superpowers/ design and plan files (~1550 lines); sister PR #970
  removed its equivalent doc before merge.
- Drop 5-line "No alignment write" prose comment from the three AArch64
  NEON_HP dispatcher blocks; the sister SQ8_FP32 ARM dispatchers carry no
  such comment — the absent alignment write already encodes the intent.
- Trim GetDistFuncSQ8FP16Asymmetric to a 7-line template-mapping check at
  dim=15, matching the shape of GetDistFuncSQ8Asymmetric (SQ8_FP32 sister).
  The scalar-fallback assertion it previously duplicated is already covered
  by the trailing block of SQ8_FP16_SpacesOptimizationTest.
The bm_spaces_sq8_fp16 executable is built but was never emitted by
benchmarks.sh, so no CI label (bm-spaces / benchmarks-all) would run it.
Register it in bm-spaces, bm-spaces-sq8-full, benchmarks-all and
benchmarks-default, and add a dedicated bm-spaces-sq8-fp16 case.
…MLALT kernel

NEON_HP: widen SQ8 storage uint8->fp16->fp32 via vcvtq_f16_u16 (values 0..255
are exact in FP16), dropping two integer-widening ops per 16-element chunk with
identical FP32 lane values.

SVE2: dedicated kernel keeping storage+query at 16-bit and using the FMLALB/FMLALT
widening multiply-accumulate pair (svmlalb_f32/svmlalt_f32). Processes svcnth()
lanes/step (2x the base-SVE svcntw() granularity) and removes explicit query
widening/conversion, roughly halving hot-loop loads and instructions. Wired into
SVE2.cpp IP/Cosine/L2 choosers at svcnth granularity.
Adds an asimdfhm-gated NEON_FHM tier for SQ8<->FP16 IP/L2/Cosine that uses
vfmlalq_low/high_f16 to multiply FP16 lanes directly into FP32 accumulators,
eliminating the explicit fp16->fp32 conversions of the NEON_HP path. Dispatchers
prefer NEON_FHM over NEON_HP when features.asimdfhm is set; benchmarks and
tier-walk unit tests register the new tier.
The FHM path processed 16 lanes/iter into 4 FP32 accumulators, leaving FMLAL
latency only partially hidden. Process two 16-lane steps per iteration into two
independent accumulator sets (8 total), with an odd-16-chunk tail, merging back
into sum0..3 before the shared residual/reduction. Matches the SVE2 kernel's ILP.
NEON_HP (non-FHM) path is unchanged.
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ lerman25
✅ dor-forer
❌ Ubuntu


Ubuntu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented Jun 2, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants