Apply type_ids and sequence_id to overflow encodings in post-processors by 1fanwang · Pull Request #2055 · huggingface/tokenizers

1fanwang · 2026-05-12T10:47:07Z

Closes #1908. Picks up where #1965 (closed unmerged for inactivity) left off — same idea, with regression tests so it doesn't slip again.

What

Three post-processors set type_ids / sequence_id on the main encoding but skip its overflows:

PostProcessor::process default impl — tokenizers/src/tokenizer/mod.rs:107
TemplateProcessing::apply_template for Piece::Sequence — tokenizers/src/processors/template.rs:554
RobertaProcessing::process_encodings — tokenizers/src/processors/roberta.rs:82

So a sentence-pair tokenization that overflows (truncation + long inputs) emits the main encoding with correct type_ids = [0, 0, ..., 1, 1, ...] but every overflow has type_ids = [0, 0, ...]. Downstream span / NER / pair-classification code that reads overflows then sees the wrong sequence ownership.

Bert is structurally fine — it reads encoding.get_type_ids() per overflow when wrapping CLS/SEP — so once the default impl propagates type_ids to overflows, Bert composes them correctly.

Fix

At each of the three sites, iterate get_overflowing_mut() and apply the same set_type_ids (and set_sequence_id where applicable) the main encoding gets.

The pre-existing template_processing_overflowing test had stale type_ids baked in that matched the buggy output — those expectations were the bug. Updated four [0, 0, 0, 0, 1] lines to [0, 0, 0, 1, 1] and one [0, 0, 0, 0, 0, 1] to [0, 0, 0, 0, 1, 1].

Tests

Two new regressions:

processors::template::tests::template_processing_overflow_type_ids — sentence pair with non-zero $B:1 type_id and pre-existing overflows on both A and B sides, asserts each overflow's type_ids matches its source sequence.
processors::roberta::tests::roberta_overflow_type_ids — overflow starting with type_id=1 must be coerced to 0 like the main encoding.

When truncation produces overflowing encodings before post-processing runs, the post-processor only updated `type_ids` / `sequence_id` on the main encoding, leaving overflows with stale values inherited from `encode_single_sequence`. Downstream code that reads the overflows back (token classification, span extraction, sliding-window inference) saw mismatched type_ids for the second sequence. Three spots needed the same fix: - `PostProcessor::process` default impl set `type_ids` on the main encoding but only `sequence_id` on overflows. - `TemplateProcessing::apply_template` set `type_ids` / `sequence_id` from the template piece but never iterated overflows. - `RobertaProcessing::process_encodings` forced `type_id=0` on the main encoding only, so pair-sequence overflows kept `type_id=1`. The pre-existing `template_processing_overflowing` test had stale expected `type_ids` that reflected the buggy output; updated to the corrected values. Added focused regression tests in both `processors/template.rs` and `processors/roberta.rs`. Closes huggingface#1908.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply type_ids and sequence_id to overflow encodings in post-processors#2055

Apply type_ids and sequence_id to overflow encodings in post-processors#2055
1fanwang wants to merge 1 commit into
huggingface:mainfrom
1fanwang:fix/post-processor-overflow-type-ids

1fanwang commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

1fanwang commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1fanwang commented May 12, 2026 •

edited

Loading