Skip to content

Add more mask methods mirroring std::simd API#226

Open
Shnatsel wants to merge 15 commits into
linebender:mainfrom
Shnatsel:more-mask-methods-final
Open

Add more mask methods mirroring std::simd API#226
Shnatsel wants to merge 15 commits into
linebender:mainfrom
Shnatsel:more-mask-methods-final

Conversation

@Shnatsel
Copy link
Copy Markdown
Contributor

@Shnatsel Shnatsel commented May 23, 2026

Follow-up to #218

Adds to/from_bitmask, mirroring std::simd. This required substantial complexity, and is covered with exhaustive roundtrip tests for smaller sizes and tests for interesting patterns on larger sizes.

Also re-introduces APIs to get/set a single bit. The API mirrors std::simd instead of the previous Index trait. The implementation has minimal complexity; it reuses to_bitmask(), which is why both changes are in the same PR.

Not included in this PR: better docs on mask types, including the cost trade-offs for various ways of creating them. I'd like to add that in a follow-up.

Performance

from_bitmask

to/from_bitmask are lowered to intrinsics for each backend.

Our codegen for from_bitmask() is on par with or better than std::simd on both x86 and NEON. Making it fast on NEON was quite easy - std::simd performs scalar bit extraction for many widths, which is slow, so on NEON we often win by a landslide. It's so bad that I reported this upstream. x86 required more effort for parity, but we're on-par-or-better there too, dramatically so for some sizes.

from_bitmask performance vs std::simd

from_bitmask sweep with llvm-mca-14 -mcpu=znver3 (-mcpu=neoverse-n1 for NEON), commit 9b7a6c5 vs nightly std::simd on 1.97.0-nightly (14196dbfa 2026-04-12)

x86 numbers sanity-checked with perf on zen4 and line up with the simulation

NEON

mask uops ours/std RT ours/std read
mask8x16 20/55 6.7/18.3 ours much better
mask16x8 11/31 3.7/10.3 ours much better
mask32x4 11/22 3.7/7.3 ours better
mask64x2 11/16 3.7/5.3 ours better
mask8x32 37/102 12.3/34.0 ours much better
mask16x16 19/62 6.3/20.7 ours much better
mask32x8 19/38 6.3/12.7 ours better
mask64x4 19/26 6.3/8.7 ours better
mask8x64 72/204 24.0/68.0 ours much better
mask16x32 36/117 12.0/39.0 ours much better
mask32x16 36/71 12.0/23.7 ours better
mask64x8 36/47 12.0/15.7 ours better

SSE4.2

mask uops ours/std RT ours/std read
mask8x16 7/7 2.5/2.5 tie
mask16x8 8/8 2.0/2.0 tie
mask32x4 7/25 2.0/4.5 ours much better
mask64x2 7/14 2.0/2.5 ours better
mask8x32 12/14 3.5/2.5 mixed, std RT better
mask16x16 12/13 3.0/3.0 ours slight
mask32x8 12/12 2.5/3.0 ours slight
mask64x4 13/56 3.5/9.3 ours much better
mask8x64 24/24 4.5/4.5 tie
mask16x32 22/22 5.0/4.0 std RT better
mask32x16 22/22 4.0/5.0 ours RT better
mask64x8 23/22 4.5/5.0 mixed, ours RT better

AVX2

mask uops ours/std RT ours/std read
mask8x16 7/7 2.5/2.5 tie
mask16x8 7/7 2.0/2.0 tie
mask32x4 7/21 2.0/4.5 ours much better
mask64x2 7/12 2.0/2.5 ours better
mask8x32 9/9 2.5/2.5 tie
mask16x16 8/8 2.0/2.0 tie
mask32x8 8/8 2.0/2.0 tie
mask64x4 8/24 2.0/4.5 ours much better
mask8x64 13/13 3.5/3.5 tie
mask16x32 14/15 3.5/3.5 ours slight
mask32x16 12/12 3.0/3.0 tie
mask64x8 12/12 3.0/3.0 tie

to_bitmask

to_bitmask() was mostly straightforward, but required special handling for 16-bit masks on x86. We're on par or better with std::simd, except for the mask16x32 case on AVX2, where std gets the ideal vpacksswb-style lowering and we only get vpmovmskb + pext. I couldn't get rustc to emit vpacksswb.

to_bitmask performance vs std::simd

to_bitmask sweep with llvm-mca-14 -mcpu=znver3 (-mcpu=neoverse-n1 for NEON), commit 9b7a6c5 vs nightly std::simd on 1.97.0-nightly (14196dbfa 2026-04-12)

x86 numbers sanity-checked with perf on zen4 and line up with the simulation

NEON

mask uops ours/std RT ours/std read
mask8x16 12/14 4.0/4.7 ours better
mask16x8 9/10 3.0/3.3 ours slight
mask32x4 8/13 3.0/4.3 ours better
mask64x2 7/15 3.0/5.0 ours better
mask8x32 23/25 7.7/8.3 ours slight
mask16x16 17/22 5.7/7.3 ours better
mask32x8 15/18 5.0/6.0 ours better
mask64x4 13/19 5.0/6.3 ours better
mask8x64 44/48 14.7/16.0 ours slight
mask16x32 32/38 10.7/12.7 ours better
mask32x16 28/33 9.3/11.0 ours better
mask64x8 24/29 9.0/9.7 ours slight

SSE4.2

mask uops ours/std RT ours/std read
mask8x16 3/3 1.0/1.0 tie
mask16x8 5/5 1.0/1.0 tie
mask32x4 3/7 1.0/1.2 ours better
mask64x2 3/9 1.0/1.5 ours better
mask8x32 7/7 2.0/2.0 tie
mask16x16 4/4 1.0/1.0 tie
mask32x8 6/6 1.0/1.0 tie
mask64x4 4/11 1.0/1.8 ours better
mask8x64 15/15 4.0/4.0 tie
mask16x32 9/9 2.0/2.0 tie
mask32x16 7/7 2.0/2.0 tie
mask64x8 15/15 2.5/2.5 tie

AVX2

mask uops ours/std RT ours/std read
mask8x16 3/3 1.0/1.0 tie
mask16x8 5/5 1.0/1.0 tie
mask32x4 3/4 1.0/1.0 ours slight
mask64x2 3/8 1.0/1.3 ours better
mask8x32 4/4 1.0/1.0 tie
mask16x16 4/4 1.0/1.0 tie
mask32x8 4/4 1.0/1.0 tie
mask64x4 4/8 1.0/1.3 ours better
mask8x64 8/8 2.0/2.0 tie
mask16x32 11/7 2.0/1.2 std better
mask32x16 10/10 1.7/1.7 tie
mask64x8 7/7 1.2/1.2 tie

set

test/set are implemented with generic codepaths without messing with intrinsics, since they're less performance-critical and the generic codegen is already good.

set() is implemented by converting the vector to an array and flipping the one value with a scalar instruction. This matches std::simd assembly. On x86 a much more complex path that keeps values in registers is possible, but it's only 5-10% faster and doesn't seem to be worth the effort. It can be added in a follow-up PR if desired.

test

test() is implemented via to_bitmask() which deviates from the assembly produced by std::simd. It is faster than roundtripping through an array if your vector is already in registers, e.g. via a.simd_cmp(b).test(n) (8 cycles instead of 12 on llvm-mca). However it is slower if the vector is already spilled to the stack. I've chosen to err on the side of not forcing stack spills. Reasonable people can disagree about the trade-offs here, but I don't think it really matters since this function is not performance-critical anyway.

Copy link
Copy Markdown
Collaborator

@LaurenzV LaurenzV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked everything except for the x86 backend (I might just rubberstamp it in the end, let's see. 😅), I do have some other concerns though.

clippy::cast_possible_truncation,
clippy::unseparated_literal_suffix,
clippy::use_self,
clippy::wrong_self_convention,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clippy complains about functions such as fn from_bitmask_mask32x8(self, bits: u64) -> mask32x8<Self>;, and its objection is that

methods called from_* usually take no self

But it's on a trait where self is Simd and from_bitmask is a method implemented on the mask trait.

The only "fix" for this lint is to break the existing naming convention, so into the suppression list it goes.

}
OpSig::MaskToBitmask => {
let arg0 = &arg_names[0];
quote! { (self, #arg0: #ty<Self>) -> u64 }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any benefit in using u8/u16/u32 for the bitmask wherever possible instead of always using u64?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. I've started out doing that, but found that it complicates things for basically no gain, makes this a lot harder to use in a generic context, and broke compatibility with std::simd API. So I switched over to u64, just like std::simd.

We still have unused bits anyway e.g. for 32x4 mask where we only use 4 bits out of a u8, so it's not like we can get rid of unused bits either way.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the current test structure, I just noticed that it seems a bit random. 😄 But not for this PR to figure out.

Comment thread fearless_simd_tests/tests/mask_methods.rs Outdated
Comment on lines +355 to +359
assert!(
index < #len,
"mask lane index {index} is out of bounds for {} lanes",
#len
);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to turn this into a debug assert and just document that a setting a larger index is undefined behavior?

Copy link
Copy Markdown
Contributor Author

@Shnatsel Shnatsel May 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we certainly don't want undefined behavior in safe code. We could go for unspecified behavior.

std::simd provides _unchecked variants of these methods, which are unsafe fn. I don't really want to do that since these methods really aren't performance-critical: if you are poking bits in a SIMD mask one by one and expect that to be fast, you're doing something wrong anyway.

If you really want to do something about this, we can either make the assert! emit fewer instructions along the lines of fast_assert (shameless plug) or add index % lanes in there somewhere, either instead of the panic or as a separate _unchecked function variant. But I don't think it's worth doing for the reasons mentioned above.

}
}

// Current backends store masks as signed integer lanes, so `set` uses a generic
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For test this is not necessary?

Copy link
Copy Markdown
Contributor Author

@Shnatsel Shnatsel May 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is one possible way to implement test() and that's what std::simd does, but this PR does it differently. We convert to a u64 bitmask instead of an array, for performance. See the PR description details, under "test" header.

/// Create a mask from a compact bitmask.
///
/// Bit `i` maps to lane `i`, with lane 0 in the least significant bit. Bits above
/// [`Self::N`] are ignored.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sentence sounds a bit ambiguous, I presume what is meant is that any lane >= Self::N is ignored? Same for from_bitmask.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The mask only stores N bits so anything that doesn't fit is ignored.


/// Test whether one logical lane is set.
///
/// Panics if `index` is greater than or equal to the number of lanes in the mask.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woukd it make sense to make the assertion a debug assertion instead and just say it's undefined behavior?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LaurenzV
Copy link
Copy Markdown
Collaborator

Also, thanks for the careful benchmarking you did!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants