Skip to content

ci: pin benchmark jobs to a larger dedicated runner#2084

Open
ArthurZucker wants to merge 1 commit into
mainfrom
ci/pin-bench-larger-runner
Open

ci: pin benchmark jobs to a larger dedicated runner#2084
ArthurZucker wants to merge 1 commit into
mainfrom
ci/pin-bench-larger-runner

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

Why

The Benchmarks workflow runs on the shared ubuntu-latest (2 vCPU) runner. Those timings drift ~10–35% run-to-run, which pollutes the baselines stored on hf-internal-testing/tokenizers-bench.

Concretely, the baseline at #2074 Avoid full vocab clone in get_vocab_size() showed +8–17% "regressions" across the batch and serialization benches — while single-encode paths simultaneously got faster. That split (encode down, everything else up) is a runner-noise fingerprint, not a real regression:

  • The entire diff of Avoid full vocab clone in get_vocab_size() #2074 is get_vocab_size() (+ the node binding), and get_vocab_size is never called inside any encode / decode / serialization loop — only during add_tokens/training setup.
  • A same-machine local A/B of Avoid full vocab clone in get_vocab_size() #2074 vs its parent put nearly everything within noise; the few flagged benches (bpe-from-file, from-file-llama3, encode-char-offsets) exercise code paths that are byte-for-byte identical between the two commits, so they cannot have regressed from this change.

What

Pin both the Rust (benchmark) and Python (benchmark-python) jobs to ubuntu-latest-8-cores for stable, comparable baselines. The orchestration-only benchmark-trigger job stays on ubuntu-latest.

⚠️ Before merge

The ubuntu-latest-8-cores label must be provisioned in the org's Settings → Actions → Runners. If your available larger-runner label is named differently, change the two runs-on: lines accordingly — otherwise the jobs will queue indefinitely.

The Benchmarks workflow ran on the shared `ubuntu-latest` (2 vCPU)
runner, whose timings vary run-to-run by ~10-35%. That noise produced
spurious "regressions" in the stored baselines on
hf-internal-testing/tokenizers-bench (e.g. #2074 showed +8-17% across
batch/serialization benches while single-encode got faster — a runner
artifact, not a code change, since that PR only touches get_vocab_size
which is off every measured hot path).

Pin both the Rust and Python benchmark jobs to `ubuntu-latest-8-cores`
for stable, comparable baselines.

NOTE: the `ubuntu-latest-8-cores` label must be provisioned in the org
Actions runner settings; adjust if the available label differs.
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Collaborator Author

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

/benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants