fix: caseless match for arm64 and proper check for win32#97
Merged
Conversation
There was a problem hiding this comment.
Code Review
The changes in CMakeLists.txt improve the robustness of the Windows build configuration. Specifically, the processor check for ARM64 is now case-insensitive by converting the processor string to uppercase, and the library extension logic has been updated to use WIN32 instead of MSVC to ensure correct naming across different Windows compilers. I have no feedback to provide.
Contributor
Author
|
@MasterJH5574 Mind giving this a review? We have windows build failing with non-MSVC compiler (e.g. clang). Thank you! |
MasterJH5574
approved these changes
May 19, 2026
Member
|
@vinovo Thanks for pinging me. The PR looks good, thank you! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two small CMake fixes that make
tokenizers-cppbuild robustly on Windowsacross different compilers, generators, and toolchain files. The previous
logic happened to work for the common MSVC + Visual Studio generator case
but was not robust to other valid Windows configurations.
Changes
1. Pick
.libvs.abased on the OS, not the host compilerrustcon Windows always emitstokenizers_c.libregardless of whichC/C++ compiler CMake is configured with — the file name is chosen by Rust,
not by the host compiler. The previous gate
if(MSVC)is only true forcl.exe, so any other Windows compiler (whereMSVCisFALSE) wasexpecting
libtokenizers_c.a. The custom command'sOUTPUTwas neverproduced and the post-build copy step failed.
Switching the gate to
if(WIN32)matches what Rust actually produces andmakes the rule depend on the target OS rather than on which host compiler
happens to be in use.
2. Case-insensitive
CMAKE_SYSTEM_PROCESSORmatch on WindowsThe Windows branch only accepted the exact spellings
ARM64(uppercase)or
aarch64(lowercase). The uppercase form is filled in by somegenerators (e.g. Visual Studio invoked with
-A ARM64); other validconfigurations such as Ninja + a toolchain file may use mixed/lowercase
arm64, which silently fell through to theelsebranch and selectedx86_64-pc-windows-msvc. Rust then cross-built the wrong architectureand the resulting
.libcould not link against the rest of the arm64build.
Normalizing
CMAKE_SYSTEM_PROCESSORwithstring(TOUPPER ...)beforecomparing makes the detection robust to whichever spelling the generator
or toolchain file happens to use.
Bonus: sentencepiece submodule bump
This PR also bumps the
sentencepiecesubmodule from11051e3toa899e9a, picking up upstream's migration from the bundled abseil-compatibility shim to the official Abseil library (LTS20260107.1, fetched viaFetchContentat configure time). The oldthird_party/absl/*shim has been removed upstream; the build now pulls real Abseil and creates asentencepiece/third_party/absl -> abseil-cpp/abslsymlink (file(CREATE_LINK ... SYMBOLIC)).The public
SentencePieceProcessorAPI surface thatsrc/sentencepiece_tokenizer.ccdepends on (LoadFromSerializedProto,Encode,Decode,GetPieceSize,IdToPiece,PieceToId) is unchanged, so no source changes are required intokenizers-cpp. The defaultSPM_ABSL_PROVIDERis now"module"(previously"internal"); this is left at the upstream default.Verification
Verified end-to-end on Windows arm64 (MSVC 19.50, VS 2026 generator, with the two CMake fixes above):
Configure:
FetchContentclonesabseil-cppat tag20260107.1; symlink creation undersentencepiece/third_party/abslsucceeds.Build: Abseil + sentencepiece + Rust
tokenizers-c+tokenizers_cpp.liball compile and link cleanly.Runtime: ran
example/example.exeagainst the four tokenizer fixtures used bybuild_and_run.sh:tokenizer.model(Vicuna 7B)[1724, 338, 278, 29871, 7483, 310, 7400, 29973], decode round-trip ✅, vocab 32000tokenizer.json(RedPajama-3B)vocab.json+merges.txt(Qwen2.5-3B)tokenizer_modelThe SentencePiece IDs match the expected LLaMA SP tokenization (including the
29871whitespace marker for the doubled space),IdToPiece/PieceToIdround-trip on the sampled IDs, and the decode assertion inTestTokenizerpasses — confirming the abseil migration is behavior-preserving for the API used by this project.