Apply type_ids and sequence_id to overflow encodings in post-processors#2055
Open
1fanwang wants to merge 1 commit into
Open
Apply type_ids and sequence_id to overflow encodings in post-processors#20551fanwang wants to merge 1 commit into
1fanwang wants to merge 1 commit into
Conversation
When truncation produces overflowing encodings before post-processing runs, the post-processor only updated `type_ids` / `sequence_id` on the main encoding, leaving overflows with stale values inherited from `encode_single_sequence`. Downstream code that reads the overflows back (token classification, span extraction, sliding-window inference) saw mismatched type_ids for the second sequence. Three spots needed the same fix: - `PostProcessor::process` default impl set `type_ids` on the main encoding but only `sequence_id` on overflows. - `TemplateProcessing::apply_template` set `type_ids` / `sequence_id` from the template piece but never iterated overflows. - `RobertaProcessing::process_encodings` forced `type_id=0` on the main encoding only, so pair-sequence overflows kept `type_id=1`. The pre-existing `template_processing_overflowing` test had stale expected `type_ids` that reflected the buggy output; updated to the corrected values. Added focused regression tests in both `processors/template.rs` and `processors/roberta.rs`. Closes huggingface#1908.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1908. Picks up where #1965 (closed unmerged for inactivity) left off — same idea, with regression tests so it doesn't slip again.
What
Three post-processors set
type_ids/sequence_idon the main encoding but skip its overflows:PostProcessor::processdefault impl —tokenizers/src/tokenizer/mod.rs:107TemplateProcessing::apply_templateforPiece::Sequence—tokenizers/src/processors/template.rs:554RobertaProcessing::process_encodings—tokenizers/src/processors/roberta.rs:82So a sentence-pair tokenization that overflows (
truncation+ long inputs) emits the main encoding with correcttype_ids = [0, 0, ..., 1, 1, ...]but every overflow hastype_ids = [0, 0, ...]. Downstream span / NER / pair-classification code that reads overflows then sees the wrong sequence ownership.Bert is structurally fine — it reads
encoding.get_type_ids()per overflow when wrapping CLS/SEP — so once the default impl propagates type_ids to overflows, Bert composes them correctly.Fix
At each of the three sites, iterate
get_overflowing_mut()and apply the sameset_type_ids(andset_sequence_idwhere applicable) the main encoding gets.The pre-existing
template_processing_overflowingtest had staletype_idsbaked in that matched the buggy output — those expectations were the bug. Updated four[0, 0, 0, 0, 1]lines to[0, 0, 0, 1, 1]and one[0, 0, 0, 0, 0, 1]to[0, 0, 0, 0, 1, 1].Tests
Two new regressions:
processors::template::tests::template_processing_overflow_type_ids— sentence pair with non-zero$B:1type_id and pre-existing overflows on both A and B sides, asserts each overflow'stype_idsmatches its source sequence.processors::roberta::tests::roberta_overflow_type_ids— overflow starting withtype_id=1must be coerced to 0 like the main encoding.