ci: pin benchmark jobs to a larger dedicated runner by ArthurZucker · Pull Request #2084 · huggingface/tokenizers

ArthurZucker · 2026-06-02T19:45:38Z

Why

The Benchmarks workflow runs on the shared ubuntu-latest (2 vCPU) runner. Those timings drift ~10–35% run-to-run, which pollutes the baselines stored on hf-internal-testing/tokenizers-bench.

Concretely, the baseline at #2074 Avoid full vocab clone in get_vocab_size() showed +8–17% "regressions" across the batch and serialization benches — while single-encode paths simultaneously got faster. That split (encode down, everything else up) is a runner-noise fingerprint, not a real regression:

The entire diff of Avoid full vocab clone in get_vocab_size() #2074 is get_vocab_size() (+ the node binding), and get_vocab_size is never called inside any encode / decode / serialization loop — only during add_tokens/training setup.
A same-machine local A/B of Avoid full vocab clone in get_vocab_size() #2074 vs its parent put nearly everything within noise; the few flagged benches (bpe-from-file, from-file-llama3, encode-char-offsets) exercise code paths that are byte-for-byte identical between the two commits, so they cannot have regressed from this change.

What

Pin both the Rust (benchmark) and Python (benchmark-python) jobs to ubuntu-latest-8-cores for stable, comparable baselines. The orchestration-only benchmark-trigger job stays on ubuntu-latest.

⚠️ Before merge

The ubuntu-latest-8-cores label must be provisioned in the org's Settings → Actions → Runners. If your available larger-runner label is named differently, change the two runs-on: lines accordingly — otherwise the jobs will queue indefinitely.

The Benchmarks workflow ran on the shared `ubuntu-latest` (2 vCPU) runner, whose timings vary run-to-run by ~10-35%. That noise produced spurious "regressions" in the stored baselines on hf-internal-testing/tokenizers-bench (e.g. #2074 showed +8-17% across batch/serialization benches while single-encode got faster — a runner artifact, not a code change, since that PR only touches get_vocab_size which is off every measured hot path). Pin both the Rust and Python benchmark jobs to `ubuntu-latest-8-cores` for stable, comparable baselines. NOTE: the `ubuntu-latest-8-cores` label must be provisioned in the org Actions runner settings; adjust if the available label differs.

HuggingFaceDocBuilderDev · 2026-06-02T19:48:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

LGTM

ArthurZucker · 2026-06-02T19:52:05Z

/benchmark

ArthurZucker commented Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: pin benchmark jobs to a larger dedicated runner#2084

ci: pin benchmark jobs to a larger dedicated runner#2084
ArthurZucker wants to merge 1 commit into
mainfrom
ci/pin-bench-larger-runner

ArthurZucker commented Jun 2, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Jun 2, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArthurZucker commented Jun 2, 2026

Why

What

⚠️ Before merge

Uh oh!

HuggingFaceDocBuilderDev commented Jun 2, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants