Skip to content

Add APIs for case folding to the standard library#154742

Open
Jules-Bertholet wants to merge 5 commits into
rust-lang:mainfrom
Jules-Bertholet:casefold
Open

Add APIs for case folding to the standard library#154742
Jules-Bertholet wants to merge 5 commits into
rust-lang:mainfrom
Jules-Bertholet:casefold

Conversation

@Jules-Bertholet
Copy link
Copy Markdown
Contributor

@Jules-Bertholet Jules-Bertholet commented Apr 3, 2026

View all comments

Libs-api requested these, so here they are.

New public API (gated behind #[feature(casefold)]):

impl char {
    pub fn to_casefold(self) -> ToCasefold;
}

impl str {
    pub fn to_casefold(&self) -> String;
    pub fn eq_ignore_case(&self) -> bool;
}

pub struct ToCasefold { ... }
impl Iterator for ToCasefold { type Item = char; ... }
impl DoubleEndedIterator for ToCasefold { ... }
impl FusedIterator for ToCasefold { }
impl ExactSizeIterator for ToCasefold { ... }
impl fmt::Display for ToCasefold { ... }

Notes

  • This only adds a negligible amount of static data to core::unicode. To accomplish that, we compute the case-folding for most characters as the lowercase of their uppercase; this double mapping adds some complexity to the implementation.
  • No normalization (e.g. NFC) is performed, so visually and semantically equivalent strings can compare unequal.
  • I have not put any effort into optimizing eq_ignore_case(); there may be a more performant implementation.
  • char::eq_ignore_case() is left to future work—it's a potential footgun, so we may want to think more deeply about how to expose and document that API.

@rustbot label T-libs-api A-unicode

@rustbot
Copy link
Copy Markdown
Collaborator

rustbot commented Apr 3, 2026

These commits modify the Cargo.lock file. Unintentional changes to Cargo.lock can be introduced when switching branches and rebasing PRs.

If this was unintentional then you should revert the changes before this PR is merged.
Otherwise, you can ignore this comment.

library/core/src/unicode/unicode_data.rs is generated by the src/tools/unicode-table-generator tool.

If you want to modify unicode_data.rs, please modify the tool then regenerate the library source file via ./x run src/tools/unicode-table-generator instead of editing unicode_data.rs manually.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Apr 3, 2026
@rustbot
Copy link
Copy Markdown
Collaborator

rustbot commented Apr 3, 2026

r? @scottmcm

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Why was this reviewer chosen?

The reviewer was selected based on:

  • Owners of files modified in this PR: @scottmcm, libs
  • @scottmcm, libs expanded to 8 candidates
  • Random selection from Mark-Simulacrum, jhpratt, scottmcm

@rustbot rustbot added A-Unicode Area: Unicode T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Apr 3, 2026
@rust-log-analyzer

This comment has been minimized.

@Jules-Bertholet Jules-Bertholet force-pushed the casefold branch 2 times, most recently from 5b5e617 to bf4ee7c Compare April 3, 2026 13:25
@scottmcm
Copy link
Copy Markdown
Member

@rustbot reroll

@rustbot rustbot assigned jhpratt and unassigned scottmcm Apr 16, 2026
@rust-bors

This comment has been minimized.

@rustbot

This comment has been minimized.

Comment thread library/alloc/src/str.rs Outdated
Comment thread library/alloc/src/str.rs
Comment thread library/core/src/char/methods.rs Outdated
Comment thread library/core/src/char/methods.rs Outdated
@rustbot
Copy link
Copy Markdown
Collaborator

rustbot commented Apr 18, 2026

This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

@rust-log-analyzer

This comment has been minimized.

With an unoptimized, non-`const` implementation
for now.
@jhpratt
Copy link
Copy Markdown
Member

jhpratt commented Apr 18, 2026

LGTM. r=me once CI passes.

@rust-log-analyzer

This comment has been minimized.

@Jules-Bertholet
Copy link
Copy Markdown
Contributor Author

r? libs-api

@rustbot rustbot assigned the8472 and unassigned jhpratt Apr 18, 2026
@jhpratt
Copy link
Copy Markdown
Member

jhpratt commented Apr 18, 2026

I don't mind, but any particular reason for the reassign? I thought it was good to go.

@Jules-Bertholet
Copy link
Copy Markdown
Contributor Author

Jules-Bertholet commented Apr 18, 2026

The API needs libs-API approval, I believe. They expressed interest in something like this, but there was never an ACP. (I also need to add a tracking issue after I get that)

@clarfonthey clarfonthey added the I-libs-api-nominated Nominated for discussion during a libs-api team meeting. label May 18, 2026
@clarfonthey
Copy link
Copy Markdown
Contributor

clarfonthey commented May 18, 2026

Figure the easiest way to get a decision on this would be to just nominate it for a meeting, but a few notes here:

  1. I opened an ACP covering generally whether we want Unicode casing data in libcore, ACP: Expose more Unicode casing data in libcore libs-team#530
  2. Since then, a more-scoped ACP covering title-case APIs was added: Add titlecase APIs to char libs-team#354 + Tracking Issue for titlecase handling in core::char #153892
  3. My ACP was stalled since Manish was deferred to as the Unicode expert, and the expert opinion was ¯\_(ツ)_/¯

So, Jules, if you'd like to write up a bit more on the motivation behind this + the situation and put that in an ACP, that would probably be easiest to review, but otherwise it feels more just like something needs to be decided on this. Basically, it ultimately boils down to whether case-folding is too niche to include in the standard library.

I think that the main benefit is so that the case-folding tables can de deduplicated across crates that need them, but that's just a vague vibe-check and no idea how many people want this in practice.

@Jules-Bertholet
Copy link
Copy Markdown
Contributor Author

Jules-Bertholet commented May 18, 2026

@clarfonthey Again, this PR is a response to a request from libs-API, who wanted to see if this could be done with low implementation burden. So I don't think I'm in the best position to speak to the broader motivation for why case folding is useful. What I can speak to is the practical considerations of putting it in the standard library.

On that front, I see one major argument in favor, and one against:

  • In favor: Putting this in the stdlib is vastly more efficient for binary sizes, compared to an external crate. It's not just that we can de-duplicate across crates; the real win is being able to de-duplicate with the other casing tables already in the standard library. Almost all characters case-fold to the lowercase of their uppercase, and this PR uses that to make the case-fold data table only 32 bytes. External crates like unicase cannot take advantage of this, because they can't guarantee that the Unicode version they were built with matches that of the standard library's casing tables.
  • Against: The implementation in this PR does not perform any form of normalization (e.g. NFC/NFKC). In practice, many use-cases for Unicode-aware case folding also want normalization, so not doing it is arguably a footgun. Adding NFC/NFKC to the standard library would be a much larger project. (Then again, unicase doesn't do normalization either, and still gets plenty of users.)

@clarfonthey
Copy link
Copy Markdown
Contributor

clarfonthey commented May 18, 2026

Based upon my interpretation of the comment, I saw it as libs-api being more comfortable adding an API like this, but since there still needs to be libs-api approval on this specific flavour of API, just doing it at the next meeting (probably in a week, since this week is rust week) is the easiest way to get approval. There's loads of bikeshedding we could do on e.g. the name, but I figure that's something that can be figured out later, it's really more about whether the team is comfortable committing to the way this is done right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Unicode Area: Unicode I-libs-api-nominated Nominated for discussion during a libs-api team meeting. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants