Skip to content

Apply type_ids and sequence_id to overflow encodings in post-processors#2055

Open
1fanwang wants to merge 1 commit into
huggingface:mainfrom
1fanwang:fix/post-processor-overflow-type-ids
Open

Apply type_ids and sequence_id to overflow encodings in post-processors#2055
1fanwang wants to merge 1 commit into
huggingface:mainfrom
1fanwang:fix/post-processor-overflow-type-ids

Conversation

@1fanwang
Copy link
Copy Markdown

@1fanwang 1fanwang commented May 12, 2026

Closes #1908. Picks up where #1965 (closed unmerged for inactivity) left off — same idea, with regression tests so it doesn't slip again.

What

Three post-processors set type_ids / sequence_id on the main encoding but skip its overflows:

  1. PostProcessor::process default impl — tokenizers/src/tokenizer/mod.rs:107
  2. TemplateProcessing::apply_template for Piece::Sequencetokenizers/src/processors/template.rs:554
  3. RobertaProcessing::process_encodingstokenizers/src/processors/roberta.rs:82

So a sentence-pair tokenization that overflows (truncation + long inputs) emits the main encoding with correct type_ids = [0, 0, ..., 1, 1, ...] but every overflow has type_ids = [0, 0, ...]. Downstream span / NER / pair-classification code that reads overflows then sees the wrong sequence ownership.

Bert is structurally fine — it reads encoding.get_type_ids() per overflow when wrapping CLS/SEP — so once the default impl propagates type_ids to overflows, Bert composes them correctly.

Fix

At each of the three sites, iterate get_overflowing_mut() and apply the same set_type_ids (and set_sequence_id where applicable) the main encoding gets.

The pre-existing template_processing_overflowing test had stale type_ids baked in that matched the buggy output — those expectations were the bug. Updated four [0, 0, 0, 0, 1] lines to [0, 0, 0, 1, 1] and one [0, 0, 0, 0, 0, 1] to [0, 0, 0, 0, 1, 1].

Tests

Two new regressions:

  • processors::template::tests::template_processing_overflow_type_ids — sentence pair with non-zero $B:1 type_id and pre-existing overflows on both A and B sides, asserts each overflow's type_ids matches its source sequence.
  • processors::roberta::tests::roberta_overflow_type_ids — overflow starting with type_id=1 must be coerced to 0 like the main encoding.

When truncation produces overflowing encodings before post-processing
runs, the post-processor only updated `type_ids` / `sequence_id` on the
main encoding, leaving overflows with stale values inherited from
`encode_single_sequence`. Downstream code that reads the overflows back
(token classification, span extraction, sliding-window inference) saw
mismatched type_ids for the second sequence.

Three spots needed the same fix:

- `PostProcessor::process` default impl set `type_ids` on the main
  encoding but only `sequence_id` on overflows.
- `TemplateProcessing::apply_template` set `type_ids` / `sequence_id`
  from the template piece but never iterated overflows.
- `RobertaProcessing::process_encodings` forced `type_id=0` on the main
  encoding only, so pair-sequence overflows kept `type_id=1`.

The pre-existing `template_processing_overflowing` test had stale
expected `type_ids` that reflected the buggy output; updated to the
corrected values. Added focused regression tests in both
`processors/template.rs` and `processors/roberta.rs`.

Closes huggingface#1908.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TemplateProcessing does not apply type_id to overflow encodings

1 participant