Add group capture support for Replace normalizer in Rust and Python#2080
Open
ander-db wants to merge 1 commit into
Open
Add group capture support for Replace normalizer in Rust and Python#2080ander-db wants to merge 1 commit into
ander-db wants to merge 1 commit into
Conversation
…alizer Add support for backreferences like $1, $2 in the Replace normalizer replacement string, enabling pattern-based reordering and transformation. - Add replace_regex method to NormalizedString with expansion support using regex::Regex::replacen for efficient capture group expansion - Store expansion_re: Option<regex::Regex> in Replace struct, compiled from the regex pattern and falling back to None for lookahead patterns - Update decode_chain to use expansion_re.replace_all when available - The original replace() method is left completely unchanged - onig.rs and fancy.rs are completely unchanged - Add Rust unit tests for group expansion in normalize and decode - Add Python bindings tests for $1, $0 and no-capture cases Closes huggingface#1760
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1760
This PR adds support for backreferences (e.g.,
$1,$2,$0) in thereplacement string of theReplacenormalizer, enabling pattern-basedreordering and transformation during normalization.Background
PR #1788 looked abandoned, so I picked up the work. This implementation leverages the
regexcrate's nativereplacen/replace_allfor$Nexpansion instead of re-running the regex engine on each match.Key changes
NormalizedString::replace_regex— new method (normalizer.rs:648) that accepts both aSysRegex(for finding matches — supports fancy-regex/onig features like lookahead) and an optionalregex::Regex(for$1expansion viareplacen). Falls back to literal replacement when the pattern uses featuresregex::Regexcan't compile (lookahead, etc.).Replacestruct (replace.rs) — storesexpansion_re: Option<regex::Regex>compiled from the raw regexpattern.
normalize()passes it toreplace_regexfor alignment-tracked replacement.decode_chain()usesregex::Regex::replace_allwhen available, otherwise falls back to the literalfind_matchesloop.replacemethod — left completely unchanged (zero diff).onig.rs/fancy.rs— unmodified from original.Backwards compatibility
expansion_reis#[serde(skip)]).Tests
—
test_replace_with_groups,test_replace_with_group_zero,test_replace_no_capture_unchangedall ✅Note
My knowledge of Rust is limited, and this PR was developed with AI assistance. I've done my best to ensure correctness, but a careful review from someone more familiar with the codebase would be appreciated.