Skip to content

Aron/delegation token propagation#223

Closed
aron-muon wants to merge 20 commits into
buildbarn:mainfrom
aron-muon:aron/delegation-token-propagation
Closed

Aron/delegation token propagation#223
aron-muon wants to merge 20 commits into
buildbarn:mainfrom
aron-muon:aron/delegation-token-propagation

Conversation

@aron-muon
Copy link
Copy Markdown
Contributor

No description provided.

aron-muon and others added 20 commits May 13, 2026 11:09
Add a maximum_background field to FUSEMountConfiguration. When set, it
caps the FUSE asynchronous request queue (FUSE_INIT max_background); when
zero, go-fuse's default of 12 is used.

The default of 12 suits bb_clientd, but is too low for bb_worker, where
a single mount serves many concurrent actions and the kernel
CongestionThreshold (max_background * 3/4) is reached almost as soon as
multiple actions begin reading. Reads beyond the threshold queue in
kernel D-state, which presents as the worker wedging while reporting
actions as still executing.

Recommended value for bb_worker is 1024.
Redirect bb_worker_container_push from container_push_official's
ghcr.io/buildbarn/ pin to ghcr.io/aron-muon/bb-worker so we can publish
images built from this fork without colliding with upstream artifacts.

Not for upstream merge — this commit lives on the aron-muon fork only.
Picks up upstream commit d0c6f26 ("Forward termination signals to child
process", PR #332). Without it, bb_worker and bb_runner running as PID 1
in containers ignore SIGTERM — kubelet has to SIGKILL after the grace
period, killing in-flight actions and accumulating Failed pods.

Validated locally: a Linux build of cmd/bb_worker now contains the
"Failed to forward signal %#v to child process" log message, confirming
the fix is linked into the binary.

Bumps both go.mod (gazelle's go_deps source) and the git_override in
MODULE.bazel (which actually controls bzlmod resolution).
Pulls in 55de026c372 ("program: exit cleanly when signal-to-self races")
from aron-muon/bb-storage. Without that, bb_worker pods exit with
code 1 on every clean SIGTERM shutdown — the time.Sleep(5) fallback
in terminateWithSignal evaluates to 5 nanoseconds and fires before
the signal-to-self can deliver.

Revert this back to the upstream HEAD once
https://github.com/buildbarn/bb-storage/pull/<TBD> lands and is released.
Redirect bb_worker_container_push from container_push_official's
ghcr.io/buildbarn/ pin to ghcr.io/aron-muon/bb-worker so we can publish
images built from this fork without colliding with upstream artifacts.

Not for upstream merge — this commit lives on the aron-muon fork only.
The upstream script gated all stamp values behind GITHUB_ACTIONS=true
and used GNU date's --date "@<unix_ts>" syntax. That meant local
builds on macOS produced no BUILD_SCM_* values, so image_push targets
that template-expand them (e.g. //cmd/bb_worker:bb_worker_container_push)
failed with `function "BUILD_SCM_TIMESTAMP" not defined`.

Drop the GITHUB_ACTIONS guard and detect GNU vs BSD date so the same
script produces the same `${TIMESTAMP}-${SHA}` tag whether it runs
on a Linux CI runner or a macOS dev machine. No behavior change in
CI; macOS builds now get a usable stamp instead of an opaque template
error.
# Conflicts:
#	MODULE.bazel
#	pkg/filesystem/virtual/configuration/fuse_mount_enabled.go
#	pkg/proto/configuration/filesystem/virtual/virtual.pb.go
#	pkg/proto/configuration/filesystem/virtual/virtual.proto
rules_img's should_stamp() only checks templates.values() for {{...}}
placeholders, not tag_file content. With our tag template living in
stamped_tags.txt (via tag_file = ...), stamp files weren't being passed
to expand-template, causing 'function "BUILD_SCM_TIMESTAMP" not
defined' build failures. Set stamp = "enabled" to force inclusion.
… fix)

Real fix this time: drops the signal-raise dance entirely instead of
just retiming the fallback. The previous bump (55de026) didn't help
because Go's runtime.dieFromSignal exits 2 before our fallback
runs — confirmed via strace on staging.

Container exit code goes from 2 (Failed/Error) to 0 (Succeeded).
…bb-runner-installer

bb_runner_installer embeds the bb_runner binary, which links our
hotfixed bb-storage. Without a parallel push target the upstream
ghcr.io/buildbarn/bb-runner-installer is used, meaning the
bb-runner sidecar still exits 1 on graceful SIGTERM (the original
5-ns sleep race in terminateWithSignal) even though bb-worker now
exits 0. Mirrors the cmd/bb_worker/BUILD.bazel pattern.
…ment

Add a configurable channel for propagating per-build identity tokens
(e.g., bb-credential-broker delegation JWTs) from the Bazel client
through the scheduler to the worker's action environment, without
contaminating the Action digest or CAS.

Scheduler (P1): new `forward_request_headers` config field. When set,
the scheduler extracts the named gRPC metadata headers from incoming
Execute() calls, wraps each value in a ForwardedRequestHeader Any
message, and appends it to the existing AuxiliaryMetadata slice that
ships in DesiredState_Executing.

Worker (P2): new `forward_auxiliary_metadata_to_environment` config
field on RunnerConfiguration. When true, the worker decodes
ForwardedRequestHeader messages from inbound AuxiliaryMetadata and
injects them into the action's environment variables. Command-proto
variables (part of the Action digest) take precedence, so a client
cannot shadow digest-committed variables.

Both features default to off — behaviour is identical to upstream
when unconfigured.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…arn#222)

* fuse: expose maximum_background in FUSEMountConfiguration

Add a maximum_background field to FUSEMountConfiguration. When set, it
caps the FUSE asynchronous request queue (FUSE_INIT max_background); when
zero, go-fuse's default of 12 is used.

The default of 12 suits bb_clientd, but is too low for bb_worker, where
a single mount serves many concurrent actions and the kernel
CongestionThreshold (max_background * 3/4) is reached almost as soon as
multiple actions begin reading. Reads beyond the threshold queue in
kernel D-state, which presents as the worker wedging while reporting
actions as still executing.

Recommended value for bb_worker is 1024.

* deps: bump bb-storage to upstream/main HEAD

Picks up commit d0c6f26 ("Forward termination signals to child process",
PR #332). Without that fix, processes running as PID 1 in containers
silently ignore SIGTERM and the kubelet has to fall back to SIGKILL
once the grace period expires. For bb_worker that means in-flight
actions are killed and the scheduler has to retry them on another worker.

Bumps both go.mod (gazelle's go_deps source) and the git_override
in MODULE.bazel that actually controls bzlmod resolution; they need
to stay in sync.

* proto: load proto_library from @protobuf//bazel

The @rules_proto//proto:defs.bzl source has been deprecated in favour
of @protobuf//bazel:proto_library.bzl in newer protobuf module versions.
This commit switches all proto_library load statements over.
- Regenerate bb_scheduler.pb.go, bb_worker.pb.go for new proto fields
- Check in generated forwarded_request_header.pb.go and gRPC stub
- Fix trailing double-space on forwardAuxiliaryMetadataToEnvironment
- Combine consecutive bool params in NewLocalBuildExecutor signature
- Alphabetize BUILD.bazel load statements in cmd/bb_worker

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…duler

Replace upstream's container_push_official (ghcr.io/buildbarn/) with a
fork-specific image_push target pointing at ghcr.io/aron-muon/bb-scheduler.
Same pattern as cmd/bb_worker. The target name is unchanged so the
existing workflow step works without modification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Regenerate virtual.pb.go (import reorder from protoc-gen-go)
- Fix 755→644 permissions on forwarded_request_header pb.go files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a new bb_runner decorator that exchanges a delegation JWT for real
upstream credentials via bb-credential-broker's /token endpoint, then
writes credential files (.netrc for MVP) into the action's input root
before delegating to the base LocalRunner.

The decorator reads BB_DELEGATION_JWT (configurable) from the action's
environment variables, removes it before the action spawns, and calls
the broker for each configured destination. If no JWT is present, the
request passes through unmodified (mixed-pool compatibility).

New config proto: CredentialInjectionConfiguration on
ApplicationConfiguration (field 15) with destinations, broker_url,
and credential file specs. When unset, behaviour is identical to
upstream.

Credential files are written atomically (write-to-temp + rename) to
avoid partial reads by concurrent processes reading .netrc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aron-muon aron-muon closed this May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant