Skip to content

Add group capture support for Replace normalizer in Rust and Python#2080

Open
ander-db wants to merge 1 commit into
huggingface:mainfrom
ander-db:feat/normalizer-regex-groups
Open

Add group capture support for Replace normalizer in Rust and Python#2080
ander-db wants to merge 1 commit into
huggingface:mainfrom
ander-db:feat/normalizer-regex-groups

Conversation

@ander-db
Copy link
Copy Markdown

Closes #1760

This PR adds support for backreferences (e.g., $1, $2, $0) in thereplacement string of the Replace normalizer, enabling pattern-basedreordering and transformation during normalization.

Background

PR #1788 looked abandoned, so I picked up the work. This implementation leverages the regex crate's native replacen/replace_all for $N expansion instead of re-running the regex engine on each match.

Key changes

  • NormalizedString::replace_regex — new method (normalizer.rs:648) that accepts both a SysRegex (for finding matches — supports fancy-regex/onig features like lookahead) and an optional
    regex::Regex (for $1 expansion via replacen). Falls back to literal replacement when the pattern uses features regex::Regex can't compile (lookahead, etc.).
  • Replace struct (replace.rs) — stores expansion_re: Option<regex::Regex> compiled from the raw regex
    pattern. normalize() passes it to replace_regex for alignment-tracked replacement. decode_chain() uses regex::Regex::replace_all when available, otherwise falls back to the literal find_matches loop.
  • replace method — left completely unchanged (zero diff).
  • onig.rs / fancy.rs — unmodified from original.

Backwards compatibility

  • Patterns without capture groups or that use lookahead/lookbehind continue to work exactly as before (literal replacement).
  • Serialization format is unchanged (expansion_re is #[serde(skip)]).

Tests

  • Rust: 254 passed (202 unit + 21 doc + 31 integration), 0 failed
  • Python: 198 passed, 3 skipped (pre-existing), 0 failures related to changes
    test_replace_with_groups, test_replace_with_group_zero, test_replace_no_capture_unchanged all ✅

Note

My knowledge of Rust is limited, and this PR was developed with AI assistance. I've done my best to ensure correctness, but a careful review from someone more familiar with the codebase would be appreciated.

…alizer

Add support for backreferences like $1, $2 in the Replace normalizer
replacement string, enabling pattern-based reordering and transformation.

- Add replace_regex method to NormalizedString with expansion support
  using regex::Regex::replacen for efficient capture group expansion
- Store expansion_re: Option<regex::Regex> in Replace struct, compiled
  from the regex pattern and falling back to None for lookahead patterns
- Update decode_chain to use expansion_re.replace_all when available
- The original replace() method is left completely unchanged
- onig.rs and fancy.rs are completely unchanged
- Add Rust unit tests for group expansion in normalize and decode
- Add Python bindings tests for $1, $0 and no-capture cases

Closes huggingface#1760
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

normalizers.Replace able to support regex group capture

1 participant