Skip to content

Add fast approximate reciprocal methods for float vectors#204

Merged
tomcur merged 5 commits into
linebender:mainfrom
tomcur:approximate-recip
May 23, 2026
Merged

Add fast approximate reciprocal methods for float vectors#204
tomcur merged 5 commits into
linebender:mainfrom
tomcur:approximate-recip

Conversation

@tomcur
Copy link
Copy Markdown
Member

@tomcur tomcur commented Feb 22, 2026

x86 and AArch64 have instructions to calculate fast approximate reciprocals, and these can speed up some algorithms quite nicely (e.g. sprinkling this in Vello's flatten_simd.rs results in -4% flattening timings for GhostScript Tiger (actually landing that there requires a bit of thought whether the lowered precision is acceptable of course!)).

There is some detail here that this PR as-is doesn't attempt to solve. x86's rcp has about 12 bits of precision, AArch64's vrecpe about 8 bits. AArch64 has an additional instruction however, vrecps, to perform a Newton refinement step, which bumps the precision to 16 bits. That'd look something like the following.

let x0 = vrecpeq_f32(a);
x0 * vrecpsq_f32(a, x0); // calculates x0 * (2 - x0 * a), roughly doubling the precision of the `x0` estimate

Then, AVX512 introduces rcp14, which allows calculating to 14-bit precision with (I believe) the same performance as rcp, and extends support to f64.

In any case, this method does the simplest thing of just exposing the cheapest hardware estimate, similar to e.g. Highway's ApproximateReciprocal.

Comment thread fearless_simd/src/generated/fallback.rs Outdated
}
#[inline(always)]
fn approximate_recip_f32x4(self, a: f32x4<Self>) -> f32x4<Self> {
self.splat_f32x4(1.0) / a
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried it, does division work without splatting? I think for mutliplication it works at least.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment thread fearless_simd/src/generated/wasm.rs Outdated
unsafe { _mm_sqrt_ps(a.into()).simd_into(self) }
}
#[inline(always)]
fn approximate_recip_f32x4(self, a: f32x4<Self>) -> f32x4<Self> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we should just spell reciprocal out. But should be fine this way!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondered the same thing, but decided to mirror e.g. f32::recip.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair!

tomcur added 2 commits May 23, 2026 13:36
x86 and AArch64 have instructions to calculate fast approximate
reciprocals, and these can speed up some algorithms quite nicely (e.g.
sprinkling this in Vello's `flatten_simd.rs` results in -4% flattening
timings for GhostScript Tiger (actually landing that there requires a
bit of thought whether the lowered precision is acceptable of course!).

There is some detail here that this PR as-is doesn't attempt to solve.
x86's `rcp` has about 12 bits of precision, AArch64's `vrecpe` about 8
bits. AArch64 has an additional instruction however, `vrecps`, to perform a
Newton refinement step, which bumps the precision to 16 bits. That'd
look something like the following.

```rust
let x0 = vrecpeq_f32(a);
x0 * vrecpsq_f32(a, x0); // calculates x0 * (2 - x0 * a), roughly doubling the precision of the `x0` estimate
```

Then, AVX512 introduces `rcp14`, which allows calculating to 14-bit
precision with (I believe) the same performance as `rcp`, and extends
support to `f64`.

In any case, this method does the simplest thing of just exposing the
cheapest hardware estimate, similar to e.g. Highway's
`ApproximateReciprocal`.
@tomcur tomcur force-pushed the approximate-recip branch from 6bdfe8d to 3121cd8 Compare May 23, 2026 11:36
@tomcur tomcur force-pushed the approximate-recip branch from b1b9b65 to 4a329bd Compare May 23, 2026 11:52
@tomcur tomcur added this pull request to the merge queue May 23, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 23, 2026
@tomcur tomcur added this pull request to the merge queue May 23, 2026
Merged via the queue into linebender:main with commit a065471 May 23, 2026
22 checks passed
@tomcur tomcur deleted the approximate-recip branch May 23, 2026 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants