Make mask types opaque#218
Conversation
daee6e0 to
8dead9c
Compare
|
The test failure from a completely unrelated part is spooky. I checked it for UB under miri with various CPU feature combinations, but it passes there, as well as on real hardware (locally and in CI) and on Intel's emulator on CI. I'm pinning it on a Rosetta bug, since real hardwware and 2 other emulators all pass. When running x86 code on the macos-latest image, it runs through Rosetta. We need to switch over to |
|
Porting vello was trivial: https://github.com/linebender/vello/compare/main...Shnatsel:opaque-masks?expand=1 So this is ready for review. |
LaurenzV
left a comment
There was a problem hiding this comment.
Haven't finished looking through it, just some first comments.
| fn as_array_mask8x16(self, a: mask8x16<Self>) -> [i8; 16usize] { | ||
| unsafe { core::mem::transmute::<__m128i, [i8; 16usize]>(a.val.0) } | ||
| } |
There was a problem hiding this comment.
Why can we still keep this method? Wouldn't this also expect that the inner representation is backed by actual integers?
There was a problem hiding this comment.
And similarly, how would load_array_mask8x16 be implemented for AVX512 if it takes the array as an integer array?
There was a problem hiding this comment.
We can't keep Deref into an integer slice/array, but explicit conversion into an owned instance is perfectly fine and still useful. So conversions to/from the integer array representation are left intact in the API on purpose. They're the cheapest way to create a mask right now, and are decently cheap even on AVX-512 (a single instruction).
| @@ -392,36 +396,12 @@ pub trait Simd: | |||
| fn widen_u8x16(self, a: u8x16<Self>) -> u16x16<Self>; | |||
| #[doc = "Reinterpret the bits of this vector as a vector of `u32` elements.\n\nThe total bit width is preserved; the number of elements changes accordingly."] | |||
| fn reinterpret_u32_u8x16(self, a: u8x16<Self>) -> u32x4<Self>; | |||
| #[doc = "Create a SIMD vector with all elements set to the given value."] | |||
| #[doc = "Create a SIMD mask with all lanes set from the given signed integer mask value."] | |||
| fn splat_mask8x16(self, val: i8) -> mask8x16<Self>; | |||
There was a problem hiding this comment.
How would splat_mask be implemented for AVX-512 since the representation of a single entry is always just a single bit?
I guess more fundamentally, the question is how we fundamentally want to allow construction of masks while preventing hidden footguns due to the exact internals of the SIMD level. For example, I imagine a from_bitmask method would be fast for AVX-512 since its the native representation but slow on NEON since you need to expand the bitmask manually, while a from_array, as it is now, is the best for NEON but would require manually encoding the mask into a single bitmask. I'm not sure what the best thing to do is here, but probably we can take inspiration from portable SIMD?
There was a problem hiding this comment.
It would check if the value is 0, and treat any other value as 1.
I've actually started this work by taking inspiration from std::simd, and I have a local branch that is a much more complete mirror of std::simd APIs. This PR is a stripped down, MVP version of it.
std::simd only implements construction of masks from bits represented as a u64 and from arrays of booleans; not from [i32; 4]. Their trade-off is greater portability but worse performance. I implemented that at first, then rolled back the removal of integer-based APIs because the assembly for bool arrays looked gnarly.
Now that you point it out, I agree it's probably better to mirror std::simd and make splat operate on booleans. I guess writing == 0 is not too much to ask of the users. I'll make that change to better align with std::simd in this case.
There was a problem hiding this comment.
I've experimented with this and looked at the assembly; it does add a neg instruction in the hot path for non-constant inputs on x86, so it's technically less efficient, but that's probably worth it for the nicer API.
…g std::simd API. Adds a `neg` on the hot path on x86 for non-constant inputs, but seems cheap enough and worth it for the nicer API.
… actual integer values involved, now that we have boolean APIs and the conversions are relevant
…re not clashing with std::simd API of the same name
|
I'd like to add to_bitmask()/from_bitmask() as well as |
|
Will try to review the rest after RustWeek! |
LaurenzV
left a comment
There was a problem hiding this comment.
LGTM, I haven't tried to understand the new codegen logic very thoroughly, but the new generated code seems good to me and I agree that making the mask representation opaque is the right move towards getting AVX-512 working. It's a bit unfortunate that there is now more code duplication in the codegen code, but not a big deal.
| "A SIMD mask of {len} {scalar_bits}-bit elements.\n\n\ | ||
| When created from a comparison operation, and as it should be used in a [`Self::select`] operation, each element will be all ones if it's \"true\", and all zeroes if it's \"false\".", | ||
| "A SIMD mask of {len} logical lanes corresponding to {scalar_bits}-bit vector elements.\n\n\ | ||
| The storage representation of this type is intentionally opaque. For compatibility with existing APIs, it may be converted to and from signed integer lanes where false is encoded as all zeroes (integer value 0) and true is encoded as all ones (integer value -1).", |
There was a problem hiding this comment.
Maybe we should add a note that the conversion to/from signed integers might not be zero-cost for all levels though?
There was a problem hiding this comment.
I'll describe the efficiency in a follow-up where I add other conversion paths
| "Behavior on mask elements that are not all zeroes or all ones is unspecified. It may vary depending on architecture, feature level, the mask elements' width, the mask vector's width, or library version.\n\n\ | ||
| The behavior is also not guaranteed to be logically consistent if mask elements are not all zeroes or all ones. `any_true` may not return the same result as `!all_false`, and `all_true` may not return the same result as `!any_false`.\n\n\ | ||
| The [`select`](crate::Select::select) operation also has unspecified behavior for mask elements that are not all zeroes or all ones. That behavior may not match the behavior of this operation." | ||
| "Masks may be converted to and from signed integer lane arrays for compatibility with older APIs. For those conversions, false is encoded as all zeroes (integer value 0) and true is encoded as all ones (integer value -1).\n\n\ |
|
It seems like you might have to rebase or open a new PR to make CI work correctly again. |
`macos-latest` is ARM, and runs x86 binaries through rosetta. We need to specify the Intel runner explicitly to get x86 hardware. Motivation: linebender#218 (comment) Our test suite is now so thorough that it caught a Rosetta bug :sunglasses:
|
I've reduced the duplication in the generation of splat(). There is more refactoring we could do, but I haven't found the right cut points for shared logic yet. The conditions end up being more awkward and hard to follow when they're not defined at the leaf. |
This is necessary for eventual AVX-512 support. Part of #179.
This does not add any AVX-512 stuff yet, just lays the groundwork by abstracting away the internal representation.
Contrary to what the description of #196 said, we don't actually need i64 vectors in the public API so long as we're not providing direct conversions between mask types and integer vector types, which aren't available even on main.
Summary of changes:
SimdMask<S>trait independent fromSimdBase<S>.DerefBytesSimdSplit/SimdCombineslide/slide_within_blocks& | ^ !but notmask & -1.This was extricated from a larger changeset I had locally that also added some std::simd API compatibility functions, but that was getting too complicated to review. I'm happy to add more APIs if there's desire and review capacity for them.
A port of vello to this API can be found here.