ci: pin benchmark jobs to a larger dedicated runner#2084
Open
ArthurZucker wants to merge 1 commit into
Open
Conversation
The Benchmarks workflow ran on the shared `ubuntu-latest` (2 vCPU) runner, whose timings vary run-to-run by ~10-35%. That noise produced spurious "regressions" in the stored baselines on hf-internal-testing/tokenizers-bench (e.g. #2074 showed +8-17% across batch/serialization benches while single-encode got faster — a runner artifact, not a code change, since that PR only touches get_vocab_size which is off every measured hot path). Pin both the Rust and Python benchmark jobs to `ubuntu-latest-8-cores` for stable, comparable baselines. NOTE: the `ubuntu-latest-8-cores` label must be provisioned in the org Actions runner settings; adjust if the available label differs.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Collaborator
Author
|
/benchmark |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The
Benchmarksworkflow runs on the sharedubuntu-latest(2 vCPU) runner. Those timings drift ~10–35% run-to-run, which pollutes the baselines stored onhf-internal-testing/tokenizers-bench.Concretely, the baseline at #2074 Avoid full vocab clone in
get_vocab_size()showed +8–17% "regressions" across the batch and serialization benches — while single-encode paths simultaneously got faster. That split (encode down, everything else up) is a runner-noise fingerprint, not a real regression:get_vocab_size()(+ the node binding), andget_vocab_sizeis never called inside any encode / decode / serialization loop — only duringadd_tokens/training setup.bpe-from-file,from-file-llama3,encode-char-offsets) exercise code paths that are byte-for-byte identical between the two commits, so they cannot have regressed from this change.What
Pin both the Rust (
benchmark) and Python (benchmark-python) jobs toubuntu-latest-8-coresfor stable, comparable baselines. The orchestration-onlybenchmark-triggerjob stays onubuntu-latest.The
ubuntu-latest-8-coreslabel must be provisioned in the org's Settings → Actions → Runners. If your available larger-runner label is named differently, change the tworuns-on:lines accordingly — otherwise the jobs will queue indefinitely.