Skip to content

ci: integrate CodSpeed continuous benchmarking#9975

Draft
adriangb wants to merge 1 commit into
apache:mainfrom
pydantic:codspeed-integration
Draft

ci: integrate CodSpeed continuous benchmarking#9975
adriangb wants to merge 1 commit into
apache:mainfrom
pydantic:codspeed-integration

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

Summary

Wires the existing criterion benches in this workspace into CodSpeed for continuous performance tracking. CodSpeed runs benches under CPU simulation in CI and posts per-PR comparison reports vs. the base branch's latest main run.

This PR is opt-in once activated: the PR workflow only fires when a maintainer adds a bench:* label, so external contributors don't blindly burn CI capacity. The main-push workflow keeps the baseline current.

The integration has been validated end-to-end on a fork (pydantic/arrow-rs): 3031 benchmarks captured from a single main run, PR runs produce clean comparison comments (e.g. "Merging this PR will not alter performance — ✅ 7 untouched benchmarks, ⏩ 3024 skipped benchmarks, comparing codspeed-smoke-test (5b1320a) with main (fcbe248)"). Public dashboard: https://codspeed.io/pydantic/arrow-rs

Design

Drop-in shim, no bench source changes

The criterion workspace dependency is renamed (via the [package] cargo trick) to codspeed-criterion-compat. This is a CodSpeed-maintained passthrough — when not running under cargo codspeed, it forwards to real criterion, so cargo bench locally is unchanged and every existing use criterion::* in every bench source file compiles unmodified.

# Cargo.toml (workspace)
criterion = { package = "codspeed-criterion-compat", version = "4.6", default-features = false }

Sharded one job per [[bench]] target

Required for two reasons:

  1. The full workspace produces well over 1000 individual benchmarks (criterion parameterizes heavily), which exceeds CodSpeed's per-upload limit.
  2. Even the parquet crate alone exceeds 1000 — per-crate sharding wasn't fine enough.

Jobs within a single workflow are auto-aggregated by CodSpeed into one unified report.

Build once, run many

setup ─┐
       ├──→ bench (matrix, ~78 shards)
build ─┘
  • setup parses every workspace member's Cargo.toml for [[bench]] entries (awk + jq), emits a JSON {crate, bench} array; new bench targets are picked up automatically.
  • build runs the full-workspace cargo codspeed build exactly once, packs target/codspeed/ into a tarball (tar preserves the +x bit; actions/upload-artifact strips it otherwise), uploads as a 1-day artifact.
  • Each bench shard downloads the artifact, unpacks it, runs cargo codspeed run -p <crate> --bench <bench>. No per-shard rebuild — CI cost scales with N × ~2 min instead of N × full build.

Label-gated PRs

codspeed-pr.yml fires on pull_request: [labeled, synchronize, opened, reopened] and only runs when the PR has at least one bench:* label:

Label Effect
bench:all Every [[bench]] in the workspace
bench:<crate> Every [[bench]] in that crate
bench:<crate-a> bench:<crate-b> Union

Label suffixes are validated against ^[a-z][a-z0-9_-]*$. Authorization is implicit: only users with write access can add labels.

While the label is attached, every push to the PR re-runs the suite (synchronize event); re-runs cancel in-progress shards via concurrency: cancel-in-progress: true.

OIDC auth

Public repo, no CODSPEED_TOKEN secret required — the workflow's id-token: write claim is what CodSpeed verifies. Workflows are repo-agnostic.

Exclusions

Ten bench targets currently fail at runtime in this workspace — pre-existing issues in the bench targets themselves, not the integration. They're listed in an EXCLUDED_BENCHES env in both workflows so the remaining ~78 shards run clean. Each excluded target should be fixed (or removed) and dropped from the list one by one:

Target Observed failure mode
arrow / merge_kernels panics at arrow-data/src/transform/primitive.rs:31:43
arrow / buffer_bit_ops runtime error
arrow / buffer_create runtime error
arrow / sort_kernel runtime error
arrow / string_run_builder runtime error
arrow / primitive_run_accessor runtime error
arrow-array / union_array runtime error
arrow-cast / parse_date runtime error
parquet / row_selection_cursor runtime error
parquet-variant-compute / variant_kernels intermittent

I'm happy to file separate upstream issues for each if helpful — or to drop the exclusion list entirely if maintainers prefer to investigate them all at once. The same merge_kernels exclusion was added by the official CodSpeed wizard's auto-generated PR (https://codspeed.io/docs/get-started/wizard), so this is consistent prior art.

Prerequisites for activation

This PR adds the workflow files but they're inert until two repo-admin actions land:

  1. Install the CodSpeed GitHub App on apache/arrow-rs. This is what posts the PR comparison comment + status check.
  2. Enroll the repository at https://codspeed.io. OIDC is automatic for public repos — no secret token configuration required.

Once both are done, the first push to main will populate the baseline and PRs labeled bench:* will receive automated comparison comments.

CI cost notes

  • Main-push workflow: 1 build + 78 shards. Build job dominates wallclock (~10 min); shards run in parallel and download from one artifact, ~2 min each.
  • PR workflow: same build, but only the bench shards for the labeled crates. A typical bench:arrow-cast run is build + 3 shards.
  • Per-target bench binaries are bundled in one ~1-2 GB artifact (well under GitHub's 5 GB free-tier limit).

Test plan

  • cargo check --workspace --benches --features arrow/test_utils,arrow-schema/ffi,parquet/test_common,parquet/experimental,parquet/async,parquet/object_store passes against this branch
  • End-to-end validation on pydantic/arrow-rs: main baseline run captured 3031 benchmarks; PR run posts comparison comment correctly; per-shard sharding stays under the 1000-benchmark limit
  • After merge and CodSpeed-App install on apache/arrow-rs, first main run populates baseline at https://codspeed.io/apache/arrow-rs
  • Create the bench:all and per-crate bench:<crate> labels in repo settings
  • Add bench:<crate> to a real PR; confirm comparison comment + status check appear

References

🤖 Generated with Claude Code

Wire the existing criterion benches into CodSpeed (https://codspeed.io)
for continuous performance tracking. CodSpeed runs benches under CPU
simulation in CI and posts per-PR comparison reports vs. the base
branch's latest main run.

Highlights
==========

- `criterion` workspace dependency renamed to
  `codspeed-criterion-compat`: a drop-in passthrough that wraps real
  criterion when running outside cargo-codspeed, so bench source code
  needs no changes (`use criterion::*` keeps working) and `cargo bench`
  locally is unaffected.

- Two workflows:

  - `.github/workflows/codspeed.yml` runs on every push to main and
    populates the base-branch baseline.

  - `.github/workflows/codspeed-pr.yml` runs on PRs when a `bench:*`
    label is attached, so external contributors don't blindly burn
    CI capacity. Labels are namespaced per crate:

        bench:all                       # whole workspace
        bench:arrow                     # all of arrow's benches
        bench:parquet bench:arrow-cast  # union

- Sharded one job per `[[bench]]` target (~78 shards after exclusions).
  Required because (a) the full workspace produces >1000 individual
  benchmarks per upload, and (b) the parquet crate alone produces >1000
  due to heavy criterion parameterization, both of which exceed
  CodSpeed's per-upload limit. Jobs in the same workflow are
  auto-aggregated by CodSpeed into a single report.
  Ref: https://codspeed.io/docs/features/sharded-benchmarks

- Build-once / run-many topology:

      setup ─┐
             ├──→ bench (matrix, N shards)
      build ─┘

  `build` does the full-workspace `cargo codspeed build` exactly once
  and uploads `target/codspeed/` as a tar artifact (tar preserves the
  +x bit, which `actions/upload-artifact` strips otherwise). Each
  bench shard downloads the artifact and invokes
  `cargo codspeed run -p <crate> --bench <bench>`. No rebuild per
  shard, so CI cost scales with N shards × ~2 min instead of ×10 min.

- Dynamic matrix: `setup` parses every workspace member's Cargo.toml
  for `[[bench]]` entries with awk + jq and emits a JSON `{crate,
  bench}` array, so new bench targets are picked up automatically
  without touching the workflow.

- Auth: GitHub OIDC. No `CODSPEED_TOKEN` secret needed for the public
  repo; the workflow's `id-token` claim is what CodSpeed verifies.

Exclusions
==========

Ten bench targets currently fail at runtime (e.g. `merge_kernels`
panics in `arrow-data/src/transform/primitive.rs:31`); these are
pre-existing issues in the benches themselves, not the integration.
They're listed in an `EXCLUDED_BENCHES` env in both workflows so the
remaining ~78 shards run clean. Each excluded target should be fixed
and removed from the list one by one.

Prerequisites for activation
============================

- Install the CodSpeed GitHub App on `apache/arrow-rs`:
  https://github.com/apps/codspeed
- Enroll the repository at https://codspeed.io
  (the OIDC integration is automatic for public repos; no secret
  token configuration required)

Once both are done, the first push to main will populate the baseline
and PRs labeled `bench:*` will receive automated CodSpeed comparison
comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant