Add more mask methods mirroring `std::simd` API by Shnatsel · Pull Request #226 · linebender/fearless_simd

Shnatsel · 2026-05-23T16:17:50Z

Follow-up to #218

Adds to/from_bitmask, mirroring std::simd. This required substantial complexity, and is covered with exhaustive roundtrip tests for smaller sizes and tests for interesting patterns on larger sizes.

Also re-introduces APIs to get/set a single bit. The API mirrors std::simd instead of the previous Index trait. The implementation has minimal complexity; it reuses to_bitmask(), which is why both changes are in the same PR.

Not included in this PR: better docs on mask types, including the cost trade-offs for various ways of creating them. I'd like to add that in a follow-up.

Performance

from_bitmask

to/from_bitmask are lowered to intrinsics for each backend.

Our codegen for from_bitmask() is on par with or better than std::simd on both x86 and NEON. Making it fast on NEON was quite easy - std::simd performs scalar bit extraction for many widths, which is slow, so on NEON we often win by a landslide. It's so bad that I reported this upstream. x86 required more effort for parity, but we're on-par-or-better there too, dramatically so for some sizes.

from_bitmask performance vs std::simd

from_bitmask sweep with llvm-mca-14 -mcpu=znver3 (-mcpu=neoverse-n1 for NEON), commit 9b7a6c5 vs nightly std::simd on 1.97.0-nightly (14196dbfa 2026-04-12)

x86 numbers sanity-checked with perf on zen4 and line up with the simulation

NEON

mask	uops ours/std	RT ours/std	read
`mask8x16`	`20/55`	`6.7/18.3`	ours much better
`mask16x8`	`11/31`	`3.7/10.3`	ours much better
`mask32x4`	`11/22`	`3.7/7.3`	ours better
`mask64x2`	`11/16`	`3.7/5.3`	ours better
`mask8x32`	`37/102`	`12.3/34.0`	ours much better
`mask16x16`	`19/62`	`6.3/20.7`	ours much better
`mask32x8`	`19/38`	`6.3/12.7`	ours better
`mask64x4`	`19/26`	`6.3/8.7`	ours better
`mask8x64`	`72/204`	`24.0/68.0`	ours much better
`mask16x32`	`36/117`	`12.0/39.0`	ours much better
`mask32x16`	`36/71`	`12.0/23.7`	ours better
`mask64x8`	`36/47`	`12.0/15.7`	ours better

SSE4.2

mask	uops ours/std	RT ours/std	read
`mask8x16`	7/7	2.5/2.5	tie
`mask16x8`	8/8	2.0/2.0	tie
`mask32x4`	7/25	2.0/4.5	ours much better
`mask64x2`	7/14	2.0/2.5	ours better
`mask8x32`	12/14	3.5/2.5	mixed, std RT better
`mask16x16`	12/13	3.0/3.0	ours slight
`mask32x8`	12/12	2.5/3.0	ours slight
`mask64x4`	13/56	3.5/9.3	ours much better
`mask8x64`	24/24	4.5/4.5	tie
`mask16x32`	22/22	5.0/4.0	std RT better
`mask32x16`	22/22	4.0/5.0	ours RT better
`mask64x8`	23/22	4.5/5.0	mixed, ours RT better

AVX2

mask	uops ours/std	RT ours/std	read
`mask8x16`	7/7	2.5/2.5	tie
`mask16x8`	7/7	2.0/2.0	tie
`mask32x4`	7/21	2.0/4.5	ours much better
`mask64x2`	7/12	2.0/2.5	ours better
`mask8x32`	9/9	2.5/2.5	tie
`mask16x16`	8/8	2.0/2.0	tie
`mask32x8`	8/8	2.0/2.0	tie
`mask64x4`	8/24	2.0/4.5	ours much better
`mask8x64`	13/13	3.5/3.5	tie
`mask16x32`	14/15	3.5/3.5	ours slight
`mask32x16`	12/12	3.0/3.0	tie
`mask64x8`	12/12	3.0/3.0	tie

to_bitmask

to_bitmask() was mostly straightforward, but required special handling for 16-bit masks on x86. We're on par or better with std::simd, except for the mask16x32 case on AVX2, where std gets the ideal vpacksswb-style lowering and we only get vpmovmskb + pext. I couldn't get rustc to emit vpacksswb.

to_bitmask performance vs std::simd

to_bitmask sweep with llvm-mca-14 -mcpu=znver3 (-mcpu=neoverse-n1 for NEON), commit 9b7a6c5 vs nightly std::simd on 1.97.0-nightly (14196dbfa 2026-04-12)

x86 numbers sanity-checked with perf on zen4 and line up with the simulation

NEON

mask	uops ours/std	RT ours/std	read
`mask8x16`	`12/14`	`4.0/4.7`	ours better
`mask16x8`	`9/10`	`3.0/3.3`	ours slight
`mask32x4`	`8/13`	`3.0/4.3`	ours better
`mask64x2`	`7/15`	`3.0/5.0`	ours better
`mask8x32`	`23/25`	`7.7/8.3`	ours slight
`mask16x16`	`17/22`	`5.7/7.3`	ours better
`mask32x8`	`15/18`	`5.0/6.0`	ours better
`mask64x4`	`13/19`	`5.0/6.3`	ours better
`mask8x64`	`44/48`	`14.7/16.0`	ours slight
`mask16x32`	`32/38`	`10.7/12.7`	ours better
`mask32x16`	`28/33`	`9.3/11.0`	ours better
`mask64x8`	`24/29`	`9.0/9.7`	ours slight

SSE4.2

mask	uops ours/std	RT ours/std	read
`mask8x16`	`3/3`	`1.0/1.0`	tie
`mask16x8`	`5/5`	`1.0/1.0`	tie
`mask32x4`	`3/7`	`1.0/1.2`	ours better
`mask64x2`	`3/9`	`1.0/1.5`	ours better
`mask8x32`	`7/7`	`2.0/2.0`	tie
`mask16x16`	`4/4`	`1.0/1.0`	tie
`mask32x8`	`6/6`	`1.0/1.0`	tie
`mask64x4`	`4/11`	`1.0/1.8`	ours better
`mask8x64`	`15/15`	`4.0/4.0`	tie
`mask16x32`	`9/9`	`2.0/2.0`	tie
`mask32x16`	`7/7`	`2.0/2.0`	tie
`mask64x8`	`15/15`	`2.5/2.5`	tie

AVX2

mask	uops ours/std	RT ours/std	read
`mask8x16`	`3/3`	`1.0/1.0`	tie
`mask16x8`	`5/5`	`1.0/1.0`	tie
`mask32x4`	`3/4`	`1.0/1.0`	ours slight
`mask64x2`	`3/8`	`1.0/1.3`	ours better
`mask8x32`	`4/4`	`1.0/1.0`	tie
`mask16x16`	`4/4`	`1.0/1.0`	tie
`mask32x8`	`4/4`	`1.0/1.0`	tie
`mask64x4`	`4/8`	`1.0/1.3`	ours better
`mask8x64`	`8/8`	`2.0/2.0`	tie
`mask16x32`	`11/7`	`2.0/1.2`	std better
`mask32x16`	`10/10`	`1.7/1.7`	tie
`mask64x8`	`7/7`	`1.2/1.2`	tie

set

test/set are implemented with generic codepaths without messing with intrinsics, since they're less performance-critical and the generic codegen is already good.

set() is implemented by converting the vector to an array and flipping the one value with a scalar instruction. This matches std::simd assembly. On x86 a much more complex path that keeps values in registers is possible, but it's only 5-10% faster and doesn't seem to be worth the effort. It can be added in a follow-up PR if desired.

test

test() is implemented via to_bitmask() which deviates from the assembly produced by std::simd. It is faster than roundtripping through an array if your vector is already in registers, e.g. via a.simd_cmp(b).test(n) (8 cycles instead of 12 on llvm-mca). However it is slower if the vector is already spilled to the stack. I've chosen to err on the side of not forcing stack spills. Reasonable people can disagree about the trade-offs here, but I don't think it really matters since this function is not performance-critical anyway.

…t/set, along with tests for them

…gh to the generic implementation to make it possible

LaurenzV

I have checked everything except for the x86 backend (I might just rubberstamp it in the end, let's see. 😅), I do have some other concerns though.

LaurenzV · 2026-05-23T18:15:57Z

    clippy::cast_possible_truncation,
    clippy::unseparated_literal_suffix,
    clippy::use_self,
+    clippy::wrong_self_convention,


Why is this needed?

Clippy complains about functions such as fn from_bitmask_mask32x8(self, bits: u64) -> mask32x8<Self>;, and its objection is that

methods called from_* usually take no self

But it's on a trait where self is Simd and from_bitmask is a method implemented on the mask trait.

The only "fix" for this lint is to break the existing naming convention, so into the suppression list it goes.

LaurenzV · 2026-05-23T18:22:06Z

+            }
+            OpSig::MaskToBitmask => {
+                let arg0 = &arg_names[0];
+                quote! { (self, #arg0: #ty<Self>) -> u64 }


Would there be any benefit in using u8/u16/u32 for the bitmask wherever possible instead of always using u64?

Not really. I've started out doing that, but found that it complicates things for basically no gain, makes this a lot harder to use in a generic context, and broke compatibility with std::simd API. So I switched over to u64, just like std::simd.

We still have unused bits anyway e.g. for 32x4 mask where we only use 4 bits out of a u8, so it's not like we can get rid of unused bits either way.

LaurenzV · 2026-05-23T18:31:42Z

Looking at the current test structure, I just noticed that it seems a bit random. 😄 But not for this PR to figure out.

LaurenzV · 2026-05-23T18:50:11Z

+                assert!(
+                    index < #len,
+                    "mask lane index {index} is out of bounds for {} lanes",
+                    #len
+                );


Would it make sense to turn this into a debug assert and just document that a setting a larger index is undefined behavior?

Well, we certainly don't want undefined behavior in safe code. We could go for unspecified behavior.

std::simd provides _unchecked variants of these methods, which are unsafe fn. I don't really want to do that since these methods really aren't performance-critical: if you are poking bits in a SIMD mask one by one and expect that to be fast, you're doing something wrong anyway.

If you really want to do something about this, we can either make the assert! emit fewer instructions along the lines of fast_assert (shameless plug) or add index % lanes in there somewhere, either instead of the panic or as a separate _unchecked function variant. But I don't think it's worth doing for the reasons mentioned above.

LaurenzV · 2026-05-23T18:53:44Z

        }
    }

+    // Current backends store masks as signed integer lanes, so `set` uses a generic


For test this is not necessary?

That is one possible way to implement test() and that's what std::simd does, but this PR does it differently. We convert to a u64 bitmask instead of an array, for performance. See the PR description details, under "test" header.

LaurenzV · 2026-05-23T19:06:16Z

+            /// Create a mask from a compact bitmask.
+            ///
+            /// Bit `i` maps to lane `i`, with lane 0 in the least significant bit. Bits above
+            /// [`Self::N`] are ignored.


That sentence sounds a bit ambiguous, I presume what is meant is that any lane >= Self::N is ignored? Same for from_bitmask.

Yes. The mask only stores N bits so anything that doesn't fit is ignored.

LaurenzV · 2026-05-23T19:07:10Z

+
+            /// Test whether one logical lane is set.
+            ///
+            /// Panics if `index` is greater than or equal to the number of lanes in the mask.


Woukd it make sense to make the assertion a debug assertion instead and just say it's undefined behavior?

See #226 (comment)

LaurenzV · 2026-05-23T19:24:55Z

Also, thanks for the careful benchmarking you did!

…impler form with all interesting values written out

… mask sizes in tests

Shnatsel added 13 commits May 23, 2026 14:27

Add optimized to/from_bitmask on masks, and an initial version of tes…

d31ed07

…t/set, along with tests for them

Implement set() via array roundtrip instead, matching std::simd assembly

da174ba

Optimize to_bitmask on NEON

9b0cfb0

optimize from_bitmask on NEON

4612f8c

Expand to/from_bitmask roundtrip tests

5dae6cc

Further optimize from_bitmask for x86, allow more granular fall-throu…

9b7a6c5

…gh to the generic implementation to make it possible

Optimize the 16-bit cases of to_bitmask() on x86

b61d4e4

Move specialization exception checks closer to where they're used

8f53a16

Optimize from_bitmask for WASM

ebce9f0

Optimize to_bitmask for WASM

8dc5eaa

Placate clippy in a straightforward way

95cb174

Make mask constants less hideous

7d73a6a

Fix doc link

b4899e3

LaurenzV reviewed May 23, 2026

View reviewed changes

Shnatsel added 2 commits May 23, 2026 23:57

Replace a complex exhaustive test for masks with a more verbose but s…

258da0a

…impler form with all interesting values written out

Replace lists of interesting values with exhaustive loops for smaller…

5235116

… mask sizes in tests

Conversation

Shnatsel commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

from_bitmask

to_bitmask

set

test

Uh oh!

LaurenzV left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shnatsel May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shnatsel May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LaurenzV commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shnatsel commented May 23, 2026 •

edited

Loading

Shnatsel May 23, 2026 •

edited

Loading

Shnatsel May 23, 2026 •

edited

Loading