Add fast approximate reciprocal methods for float vectors by tomcur · Pull Request #204 · linebender/fearless_simd

tomcur · 2026-02-22T14:19:01Z

x86 and AArch64 have instructions to calculate fast approximate reciprocals, and these can speed up some algorithms quite nicely (e.g. sprinkling this in Vello's flatten_simd.rs results in -4% flattening timings for GhostScript Tiger (actually landing that there requires a bit of thought whether the lowered precision is acceptable of course!)).

There is some detail here that this PR as-is doesn't attempt to solve. x86's rcp has about 12 bits of precision, AArch64's vrecpe about 8 bits. AArch64 has an additional instruction however, vrecps, to perform a Newton refinement step, which bumps the precision to 16 bits. That'd look something like the following.

let x0 = vrecpeq_f32(a);
x0 * vrecpsq_f32(a, x0); // calculates x0 * (2 - x0 * a), roughly doubling the precision of the `x0` estimate

Then, AVX512 introduces rcp14, which allows calculating to 14-bit precision with (I believe) the same performance as rcp, and extends support to f64.

In any case, this method does the simplest thing of just exposing the cheapest hardware estimate, similar to e.g. Highway's ApproximateReciprocal.

LaurenzV · 2026-02-23T13:45:58Z

    }
    #[inline(always)]
+    fn approximate_recip_f32x4(self, a: f32x4<Self>) -> f32x4<Self> {
+        self.splat_f32x4(1.0) / a


I haven't tried it, does division work without splatting? I think for mutliplication it works at least.

LaurenzV · 2026-02-23T13:49:13Z

        unsafe { _mm_sqrt_ps(a.into()).simd_into(self) }
    }
    #[inline(always)]
+    fn approximate_recip_f32x4(self, a: f32x4<Self>) -> f32x4<Self> {


I'm wondering whether we should just spell reciprocal out. But should be fine this way!

Wondered the same thing, but decided to mirror e.g. f32::recip.

x86 and AArch64 have instructions to calculate fast approximate reciprocals, and these can speed up some algorithms quite nicely (e.g. sprinkling this in Vello's `flatten_simd.rs` results in -4% flattening timings for GhostScript Tiger (actually landing that there requires a bit of thought whether the lowered precision is acceptable of course!). There is some detail here that this PR as-is doesn't attempt to solve. x86's `rcp` has about 12 bits of precision, AArch64's `vrecpe` about 8 bits. AArch64 has an additional instruction however, `vrecps`, to perform a Newton refinement step, which bumps the precision to 16 bits. That'd look something like the following. ```rust let x0 = vrecpeq_f32(a); x0 * vrecpsq_f32(a, x0); // calculates x0 * (2 - x0 * a), roughly doubling the precision of the `x0` estimate ``` Then, AVX512 introduces `rcp14`, which allows calculating to 14-bit precision with (I believe) the same performance as `rcp`, and extends support to `f64`. In any case, this method does the simplest thing of just exposing the cheapest hardware estimate, similar to e.g. Highway's `ApproximateReciprocal`.

tomcur force-pushed the approximate-recip branch from 691c363 to 52520f7 Compare February 22, 2026 14:20

LaurenzV approved these changes Feb 23, 2026

View reviewed changes

tomcur added 2 commits May 23, 2026 13:36

Make tests consistent

3121cd8

tomcur force-pushed the approximate-recip branch from 6bdfe8d to 3121cd8 Compare May 23, 2026 11:36

tomcur added 3 commits May 23, 2026 13:40

Use implicit splatting

30f295f

Use explicit splatting for WASM (implicit doesn't work)

44b833b

Try using Div for WASM

4a329bd

tomcur force-pushed the approximate-recip branch from b1b9b65 to 4a329bd Compare May 23, 2026 11:52

tomcur added this pull request to the merge queue May 23, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 23, 2026

tomcur added this pull request to the merge queue May 23, 2026

Merged via the queue into linebender:main with commit a065471 May 23, 2026
22 checks passed

tomcur deleted the approximate-recip branch May 23, 2026 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fast approximate reciprocal methods for float vectors#204

Add fast approximate reciprocal methods for float vectors#204
tomcur merged 5 commits into
linebender:mainfrom
tomcur:approximate-recip

tomcur commented Feb 22, 2026

Uh oh!

LaurenzV Feb 23, 2026

Uh oh!

tomcur May 23, 2026

Uh oh!

Uh oh!

LaurenzV Feb 23, 2026

Uh oh!

tomcur Feb 23, 2026

Uh oh!

LaurenzV Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomcur commented Feb 22, 2026

Uh oh!

LaurenzV Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

tomcur May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LaurenzV Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

tomcur Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

LaurenzV Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants