fix(runner): use K8s SA token for backend credential fetches#1218
fix(runner): use K8s SA token for backend credential fetches#1218markturansky wants to merge 47 commits intomainfrom
Conversation
## Summary - Introduces end-to-end MPP OpenShift integration: ambient-api-server + ambient-control-plane (gRPC fan-out multiplexer), SDK (Go/TS/Python), CLI, runner, MCP server, and frontend changes - Adds deployment manifests: base CP service, RBAC triad, mpp-openshift overlay, openshift-dev overlay, production image entries - Adds spec/guide/context documentation system (They Write The Right Stuff process model) and Claude skills docs for api-server and gRPC dev ## Test plan - [ ] `acpctl session create` + `acpctl session messages -f` against MPP integration environment - [ ] `acpctl session events <id>` streams AG-UI events via gRPC fan-out - [ ] `kustomize build components/manifests/overlays/mpp-openshift/` renders cleanly - [ ] `kustomize build components/manifests/overlays/openshift-dev/` renders cleanly - [ ] Runner tests: `cd components/runners/ambient-runner && python -m pytest tests/` - [ ] CLI tests: `cd components/ambient-cli && make test` 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
## Summary - `build-control-plane` was defined in the Makefile but missing from the `build-all` target, causing it to be silently skipped during full builds and CI runs - Adds `build-control-plane` to the `build-all` dependency chain alongside `build-mcp` ## Jira RHOAIENG-55811 ## Test plan - [ ] `make build-all` completes and includes the control-plane image Co-authored-by: Ambient Code Bot <bot@ambient-code.local>
…te RHOAIENG-55826 (#1094) ## Summary - Remove deprecated `ProjectAgent`, `ProjectDocument`, and `Ignite` API surface across ambient-api-server, ambient-sdk (Go/Python/TypeScript), ambient-cli, and ambient-mcp - Rename `ignite` → `start`, `ProjectAgent` → `Agent` throughout - Remove `e_openshift_dev` environment and `openapi.projectAgents.yaml` spec ## Components Changed < /dev/null | Component | Changes | |-----------|---------| | ambient-api-server | Remove ProjectAgent* models, IgniteRequest/Response, projectAgents OpenAPI spec; update agents/inbox/projects/roles/sessions plugins; remove e_openshift_dev | | ambient-sdk (Go) | Delete ProjectAgentAPI, ProjectDocumentAPI, ignite types; update AgentAPI, client, inbox, session_messages | | ambient-sdk (Python) | Delete _project_agent_api, _project_document_api, _session_check_in_api, _a_g_u_i_event_api, _agent_message_api | | ambient-sdk (TypeScript) | Delete project_agent, project_document, session_check_in, a_g_u_i_event, agent_message modules | | ambient-cli | Remove probe command, ag_ui.go; update agent/create/get/project/session/start | | ambient-mcp | Update server.go | ## Jira [RHOAIENG-55826](https://redhat.atlassian.net/browse/RHOAIENG-55826) ## Test plan - [ ] ambient-api-server: `make test` passes - [ ] ambient-sdk Go: `go build ./...` passes - [ ] ambient-sdk TypeScript: `npm run build` passes - [ ] ambient-cli: `go build ./...` passes 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
…olution] Automated merge of upstream/main into upstream/alpha. Conflict markers are present in files below. A human must resolve these before merging this PR. Commits from main not yet in alpha: 10 merge-base: f1f5bb1 Generated by scripts/rebase-main-to-alpha.sh on 20260330-123045.
…drift - align frontend status-colors test with current error behavior - align runner initial-prompt and grpc_transport tests with current implementation - fix control-plane InboxMessages().List() call to new SDK signature - fix mpp-openshift overlay: remove unsupported api-server flags, fix secret name refs - fix install.sh: remove ambient-control-plane-token from copied secrets, handle SA token re-typing - fix runner grpc_transport: handle None content in assistant MESSAGES_SNAPSHOT - update ambient-pr-test skill to always build images (not rely on CI) - add .idea/ to .gitignore 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
## Summary Automated merge of `main` into `alpha`. - Commits from `main` not yet in `alpha`: **10** - merge-base: `f1f5bb1669ad1e1b2cc3cfd90fd439f7ee8609dc` - Generated: 20260330-123045 ## Review Instructions 1. Check for conflict markers (`<<<<<<<`) in changed files and resolve them. 2. Cherry-pick any alpha-specific fix commits onto this branch. 3. Verify CI passes. 4. Merge into `alpha` using **Create a merge commit** (not rebase). Generated by `scripts/rebase-main-to-alpha.sh`.
…55812) (#1107) ## Summary - Adds `acpctl login --use-auth-code` for browser-based OAuth2 authorization code + PKCE login against Red Hat SSO (`redhat-external` realm) - Replaces the requirement to manually paste a pre-obtained bearer token - 18 unit tests covering PKCE generation, callback handler, token parsing, and error extraction ## What changed **`components/ambient-cli/cmd/acpctl/login/authcode.go`** (new) - Ephemeral loopback listener on a random port for the OAuth callback - RFC 7636 PKCE S256 (crypto/rand, SHA-256, base64url) - State parameter for CSRF protection - `srv.Shutdown` deferred — runs on all exit paths including timeout - RH SSO `error_description` extracted from non-200 token responses - Token response parsed with `encoding/json` (not hand-rolled) - No new dependencies — pure stdlib **`components/ambient-cli/cmd/acpctl/login/cmd.go`** (modified) - `--use-auth-code` flag (mutually exclusive with `--token`) - `--issuer-url` (default: `https://sso.redhat.com/auth/realms/redhat-external`) - `--client-id` (default: `ocm-cli` — TODO RHOAIENG-55817) - `--client-secret` (never persisted to config) **`components/ambient-cli/cmd/acpctl/login/authcode_test.go`** (new, 18 tests) ## Usage ```bash # Browser-based login (new) acpctl login --use-auth-code --url https://api.example.com # Static token login (unchanged) acpctl login --token <token> --url https://api.example.com ``` ## Test plan - [x] `go build ./...` passes - [x] `go vet ./...` passes - [x] `gofmt -l` clean - [x] `golangci-lint run` — 0 issues - [x] `go test ./cmd/acpctl/login/...` — 18/18 pass ## Related - Jira: RHOAIENG-55812 - Depends on: RHOAIENG-55817 (register `acpctl` public OIDC client in redhat-external realm) 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
## Summary - Introduces `Credential` as a first-class Ambient Kind in the platform data model spec - Adds `docs/internal/design/credentials-session.md` — full design spec with ERD, ownership model, API/CLI reference, usage examples, and open questions - Updates `docs/internal/design/ambient-model.spec.md` to reflect the desired state: new Credential entity, `RoleBinding.scope` extended with `credential`, new roles (`credential:owner`, `credential:reader`), Credentials API and CLI sections ## What this is This is a **spec-only PR** — no code changes. The goal is design review before implementation begins. The reconciler will use this spec as the desired state to surface implementation gaps. Key design decisions captured: - `Credential` is platform-scoped (not project/agent-scoped) to support shared Robot Accounts - Ownership via `RoleBinding(scope=credential, role=credential:owner)` — consistent with Agent ownership pattern - Token is write-only; never returned via standard REST API - Scope hierarchy (agent → project → global) for credential resolution at session ignition - Runner token endpoint shape is marked TBD (open question in the design doc) ## Test plan - [ ] Design review — read `docs/internal/design/credentials-session.md` - [ ] Verify ERD changes in `ambient-model.spec.md` are consistent with the design doc - [ ] Answer open questions in `credentials-session.md` before implementation begins Closes RHOAIENG-55817 (design phase) 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
…NTIAL_IDS into runner pods (RHOAIENG-55817)
Control plane (kube_reconciler.go):
- resolveCredentialIDs(): pages SDK credentials, builds provider→id map
- ensureCredentialRoleBindings(): grants credential:token-reader RoleBinding per credential to session SA
- buildEnv(): marshals credentialIDs map to JSON and injects as CREDENTIAL_IDS env var
- provisionSession(): wires credential resolution between ensureServiceAccount and ensurePod
Runner (platform/auth.py):
- _fetch_credential(): reads CREDENTIAL_IDS env var, calls new /credentials/{id}/token endpoint
- populate_runtime_credentials(): Jira apiToken→token, Google accessToken→token (SA JSON written to GOOGLE_APPLICATION_CREDENTIALS)
- Removed duplicate clear_runtime_credentials stub
Tests (test_shared_session_credentials.py):
- Updated 8 existing tests for new CREDENTIAL_IDS contract (removed PROJECT_NAME, added CREDENTIAL_IDS)
- Added test_returns_empty_when_no_credential_id_for_provider
Docs:
- control-plane.spec.md: removed 'other' from token response, updated Wave 5 status to implemented
- control-plane.guide.md: restructured with dev context reference, updated gap table
- control-plane-development.md: fixed operator→CP references, added CREDENTIAL_IDS section
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
"git add -A" stages all untracked content, including stray cloned repositories that happen to exist in the worktree. These get recorded as orphaned gitlinks, which break ArgoCD sync because the referenced submodule commits do not exist in the remote. "git add -u" limits staging to already-tracked files, which is the correct intent during conflict resolution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
## Summary - Remove orphaned `160000` gitlink entries for both `platform-api-server` and `platform-control-plane` that were accidentally introduced by the automated merge script in commit `adc3b9c2` - There are no corresponding `.gitmodules` configurations, so these entries are dangling and cause kustomize/ArgoCD to fail when processing the repository tree ## Details Both gitlinks pointed to commits in submodules that do not exist in this repository: | Gitlink | Commit | |---------|--------| | `platform-api-server` | `936ea12b22ab15a12657f6aa89eeb0c19f41c191` | | `platform-control-plane` | `4dae05c6ef8e4e4ef74bfe3c4b1c86467fb1f516` | These broke deployments on the `alpha` branch because kustomize and ArgoCD cannot resolve the references. Fixes #1130 --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary - Replace `git add -A` with `git add -u` in `scripts/rebase-main-to-alpha.sh` (both occurrences on lines 67 and 87) - `git add -A` stages all untracked content, including stray cloned repositories in the worktree, which get recorded as orphaned gitlinks that break ArgoCD sync - `git add -u` only stages changes to already-tracked files, which is the correct behavior during conflict resolution Fixes #1131
- RoleBinding name now includes session SA name to prevent subject
collision when concurrent sessions in the same project share a provider
(was credential-token-reader-{provider}, now {session-sa}-credential-{provider})
- clear_runtime_credentials() now removes GOOGLE_APPLICATION_CREDENTIALS
path in addition to the hardcoded workspace credentials path, preventing
SA key files from leaking across turns when GOOGLE_APPLICATION_CREDENTIALS
is set to a non-default path
- Simplified test_returns_empty_when_no_credential_id_for_provider to use
monkeypatch instead of redundant nested patch.dict + pop pattern
- Updated control-plane-development.md to clarify that credential scope
filtering is server-side (RBAC), not implemented in resolveCredentialIDs
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
…NTIAL_IDS into runner pods (RHOAIENG-55817) (#1128) ## Summary - **CP**: `resolveCredentialIDs()` pages SDK credentials and builds a `provider→id` map; `ensureCredentialRoleBindings()` grants `credential:token-reader` RoleBinding per credential to the session SA; `buildEnv()` injects `CREDENTIAL_IDS` JSON env var into runner pods - **Runner**: `_fetch_credential()` migrated from legacy endpoint to `GET /api/ambient/v1/credentials/{id}/token` using `CREDENTIAL_IDS` map; Jira `apiToken`→`token`, Google `accessToken`→`token` (full SA JSON written to `GOOGLE_APPLICATION_CREDENTIALS`) - **Tests**: 8 existing credential tests updated for new `CREDENTIAL_IDS` contract; 1 new test added; 29/29 credential tests passing, 622 total pass ## Credential Flow ``` Session start → CP calls sdk.Credentials().ListAll() → builds {"github": "id1", "jira": "id2"} map → grants credential:token-reader RoleBinding per provider to session SA → injects CREDENTIAL_IDS={"github": "id1", ...} into pod env Runner start → reads CREDENTIAL_IDS → calls GET /api/ambient/v1/credentials/{id}/token per provider → sets GITHUB_TOKEN, GITLAB_TOKEN, JIRA_API_TOKEN, GOOGLE_APPLICATION_CREDENTIALS ``` ## Files Changed < /dev/null | Component | File | Change | |---|---|---| | CP | `kube_reconciler.go` | `resolveCredentialIDs`, `ensureCredentialRoleBindings`, `buildEnv` updated, `provisionSession` wired | | Runner | `platform/auth.py` | `_fetch_credential` URL + field mapping; Google/Jira updated; dead stub removed | | Tests | `test_shared_session_credentials.py` | Updated for `CREDENTIAL_IDS` contract; 1 new test | | Docs | `control-plane.spec.md` | Removed `other` from token response; updated status to implemented | | Docs | `control-plane.guide.md` | Restructured with dev context reference; gap table updated | | Docs | `.claude/context/control-plane-development.md` | Fixed operator→CP refs; added `CREDENTIAL_IDS` section | ## Test plan - [x] `cd components/runners/ambient-runner && python -m pytest tests/ -k credential` — 29/29 pass - [x] `cd components/ambient-control-plane && go build ./... && go vet ./...` — clean - [x] `gofmt -l internal/reconciler/kube_reconciler.go` — no output (clean) - [x] `uv run ruff check .` — no errors - [ ] Deploy to MPP cluster and verify `CREDENTIAL_IDS` injected into pod env - [ ] Verify `credential-token-reader-{provider}` RoleBindings created per session - [ ] Verify `GITLAB_TOKEN`/`GITHUB_TOKEN` set in running pod 🤖 Generated with [Claude Code](https://claude.ai/code)
…olution] Automated merge of upstream/main into upstream/alpha. Conflict markers are present in files below. A human must resolve these before merging this PR. Commits from main not yet in alpha: 12 merge-base: 595d790 Generated by scripts/rebase-main-to-alpha.sh on 20260402-100225.
## Summary Automated merge of `main` into `alpha`. - Commits from `main` not yet in `alpha`: **12** - merge-base: `595d79011a81e828335e7ba8cd51d17a520c5f8b` - Generated: 20260402-100225 ## Review Instructions 1. Check for conflict markers (`<<<<<<<`) in changed files and resolve them. 2. Cherry-pick any alpha-specific fix commits onto this branch. 3. Verify CI passes. 4. Merge into `alpha` using **Create a merge commit** (not rebase). Generated by `scripts/rebase-main-to-alpha.sh`.
…route The ambient-api-server Route in the mpp-openshift overlay had a hardcoded spec.host pointing to the preprod cluster ingress. Removing spec.host lets OpenShift's router auto-assign the correct hostname for the target environment. Fixes: RHOAIENG-56570 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…route (#1157) ## Summary - Removes the hardcoded `spec.host` field from `components/manifests/overlays/mpp-openshift/ambient-api-server-route.yaml` - The hardcoded value (`ambient-api-server-ambient-code--runtime-int.internal-router-shard.mpp-w2-preprod.cfln.p1.openshiftapps.com`) pointed at the preprod cluster ingress, causing the route to advertise the wrong hostname when deployed to other environments (e.g., dev) - OpenShift's router will now auto-assign the correct hostname based on the cluster's router configuration ## Test plan - [ ] Verify route hostname auto-assigns correctly after applying to dev cluster - [ ] Confirm `oc get route ambient-api-server -n ambient-code--ambient-s0` shows the correct dev hostname Jira: [RHOAIENG-56570](https://redhat.atlassian.net/browse/RHOAIENG-56570) 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Updated the ambient API server route configuration to use dynamic host assignment instead of a fixed hostname. API functionality and security settings remain unchanged. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…port
installServiceCAIntoDefaultTransport replaced http.DefaultTransport with a
bare &http.Transport{TLSClientConfig: ...} that had no Proxy field set.
Go's net/http silently ignores HTTPS_PROXY/HTTP_PROXY env vars when the
transport's Proxy field is nil, causing all outbound connections to go
direct instead of through the cluster egress proxy.
This manifested as the OIDC token fetch to sso.redhat.com timing out after
~9 minutes (raw TCP connect timeout) despite the proxy env vars being
present on the pod.
Fix: set Proxy: http.ProxyFromEnvironment and restore the standard
DefaultTransport dialer/timeout fields that the bare struct initializer
was silently zeroing out.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
…port (#1162) ## Summary - `installServiceCAIntoDefaultTransport` replaced `http.DefaultTransport` with a bare `&http.Transport{TLSClientConfig: ...}` — no `Proxy` field set - Go's `net/http` silently ignores `HTTPS_PROXY`/`HTTP_PROXY` env vars when the transport's `Proxy` field is `nil`, causing all outbound connections to go direct - This caused the OIDC token fetch to `sso.redhat.com` to time out after ~9 minutes (raw TCP connect timeout) despite proxy env vars being present on the pod ## Root cause ```go // Before — Proxy field absent → env vars silently ignored http.DefaultTransport = &http.Transport{ TLSClientConfig: &tls.Config{...}, } // After — proxy wired + standard DefaultTransport fields restored http.DefaultTransport = &http.Transport{ Proxy: http.ProxyFromEnvironment, DialContext: (&net.Dialer{Timeout: 30*time.Second, KeepAlive: 30*time.Second}).DialContext, ForceAttemptHTTP2: true, MaxIdleConns: 100, IdleConnTimeout: 90 * time.Second, TLSHandshakeTimeout: 10 * time.Second, ExpectContinueTimeout: 1 * time.Second, TLSClientConfig: &tls.Config{MinVersion: tls.VersionTLS12, RootCAs: pool}, } ``` The dialer/timeout fields were also zeroed out by the bare struct initializer, which degraded connection pooling and timeout behavior for all HTTP calls. ## Test plan - [ ] Deploy updated control-plane image to MPP dev cluster - [ ] Confirm OIDC token fetch succeeds (no 9-minute timeout in logs) - [ ] Verify `component: oidc-token-provider` log shows "OIDC token acquired" within seconds 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Enhanced connection management and timeout configuration to improve control plane reliability and stability. Updates include optimized connection pooling, improved proxy handling, and refined timeout settings for more robust performance. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ft kustomization The RBAC Role/RoleBinding granting the control-plane SA get/list/watch/create/delete on tenantnamespaces.tenant.paas.redhat.com in ambient-code--config already existed but was never referenced in kustomization.yaml, causing Forbidden errors when the MPP provisioner tried to manage TenantNamespace CRs. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…ft kustomization (#1167) ## Summary - `ambient-control-plane-rbac.yaml` already existed in the mpp-openshift overlay with the correct `Role`/`RoleBinding` granting the CP service account `get/list/watch/create/delete` on `tenantnamespaces.tenant.paas.redhat.com` in `ambient-code--config` - The file was never listed in `kustomization.yaml`, so it was never applied — causing `Forbidden` errors when the `MPPNamespaceProvisioner` tried to manage `TenantNamespace` CRs - Fix: add `- ambient-control-plane-rbac.yaml` to the `resources:` list ## Root Cause Error observed after PR #1162 merged: ``` tenantnamespaces.tenant.paas.redhat.com "test" is forbidden: User "system:serviceaccount:ambient-code--ambient-s0:ambient-control-plane" cannot get resource "tenantnamespaces" in API group "tenant.paas.redhat.com" in the namespace "ambient-code--config" ``` ## Test plan - [ ] Apply kustomize overlay to MPP cluster and confirm no Forbidden errors on `tenantnamespaces` operations - [ ] CP pod logs should show successful project namespace provisioning 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Updated infrastructure configuration to include additional role-based access control settings for enhanced security management. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…cement - Remove duplicated ClusterRole/ClusterRoleBinding from mpp-openshift overlay (already covered by base/rbac/control-plane-clusterrole.yaml) - Keep only the MPP-specific Role/RoleBinding for tenantnamespaces.tenant.paas.redhat.com - Add Kustomize replacement to inject the overlay namespace into subjects[0].namespace so any overlay deploying to a different namespace automatically binds the correct SA Previously the subject was hardcoded to ambient-code--runtime-int, causing Forbidden errors when the CP runs in a different namespace (e.g. ambient-code--ambient-s0). 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…cement (#1168) ## Summary Follow-up to #1167. The wired-in RBAC had two issues: 1. **Wrong subject namespace**: `subjects[0].namespace` was hardcoded to `ambient-code--runtime-int`, but the CP runs in whatever namespace the overlay deploys to. When deployed to `ambient-code--ambient-s0`, the binding was silently wrong. 2. **Duplicate ClusterRole/ClusterRoleBinding**: The overlay had its own `ClusterRole`/`ClusterRoleBinding` duplicating what `base/rbac/control-plane-clusterrole.yaml` already provides. ## Fix - Remove the duplicated `ClusterRole`/`ClusterRoleBinding` from `ambient-control-plane-rbac.yaml` - Keep only the MPP-specific `Role`/`RoleBinding` for `tenantnamespaces.tenant.paas.redhat.com` - Add a Kustomize `replacement` that sources `subjects[0].namespace` from the `ambient-control-plane` ServiceAccount's `metadata.namespace` — which Kustomize automatically rewrites to match the overlay's `namespace:` field. Any future overlay deploying to a different namespace gets the correct binding automatically, with zero duplication. ## Verification ``` kustomize build components/manifests/overlays/mpp-openshift/ # RoleBinding subjects[0].namespace == ambient-code--runtime-int ✓ ``` ## Test plan - [ ] Apply to MPP cluster and confirm no Forbidden errors on `tenantnamespaces` operations - [ ] CP pod logs show successful project namespace provisioning 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Simplified permissions by removing unused role declarations. * Improved namespace configuration synchronization for role bindings. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ace transformer Remove namespace: ambient-code--runtime-int from kustomization.yaml and set it explicitly on each resource file instead. This allows cross-namespace resources (Role/RoleBindings in ambient-code--config) to coexist in the same overlay without being overwritten by the Kustomize namespace transformer. Per-sector RoleBindings for tenantnamespaces.tenant.paas.redhat.com: - ambient-control-plane-rbac-runtime-int.yaml: binds ambient-code--runtime-int SA - ambient-control-plane-rbac-s0.yaml: binds ambient-code--ambient-s0 SA Both grant get/list/watch/create/delete on tenantnamespaces in ambient-code--config. Adding ambient-code--s1 in future requires only a new ambient-control-plane-rbac-s1.yaml. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…ace transformer (#1171) ## Summary - Removes `namespace: ambient-code--runtime-int` from `kustomization.yaml` — the overlay namespace transformer was overwriting `ambient-code--config` on the tenant RBAC resources - Adds `namespace: ambient-code--runtime-int` explicitly to each resource file that needs it - Moves tenant RBAC into `tenant-rbac/` sub-kustomization so it can live in `ambient-code--config` cleanly - Adds per-sector RoleBindings for `tenantnamespaces.tenant.paas.redhat.com`: - `ambient-control-plane-rbac-runtime-int.yaml` — binds `ambient-code--runtime-int:ambient-control-plane` - `ambient-control-plane-rbac-s0.yaml` — binds `ambient-code--ambient-s0:ambient-control-plane` - Adding s1 (or any future sector) requires only a new `ambient-control-plane-rbac-s1.yaml` ## Verification ``` kustomize build components/manifests/overlays/mpp-openshift/ # Role + RoleBindings → namespace: ambient-code--config ✓ # s0 subject → namespace: ambient-code--ambient-s0 ✓ # runtime-int subject → namespace: ambient-code--runtime-int ✓ # All other resources → namespace: ambient-code--runtime-int ✓ ``` ## Test plan - [ ] Apply to MPP cluster — CP pod in `ambient-code--ambient-s0` can get/create/delete `tenantnamespaces` in `ambient-code--config` 🤖 Generated with [Claude Code](https://claude.ai/code)
When PROJECT_KUBE_TOKEN_FILE is set, the project kube client carries the TSA identity (tenantaccess-ambient-control-plane) which already has namespace-admin RoleBindings in ambient-code--config. The main in-cluster SA does not have access to tenantnamespaces.tenant.paas.redhat.com. Pass projectKube to buildNamespaceProvisioner when available so the MPP TenantNamespace provisioner uses the TSA token rather than the pod SA token. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…#1175) ## Summary When `PROJECT_KUBE_TOKEN_FILE` is set, the project kube client carries the TSA identity (`tenantaccess-ambient-control-plane`) which already has `namespace-admin` RoleBindings in `ambient-code--config` — including access to `tenantnamespaces.tenant.paas.redhat.com`. The main in-cluster SA does not. `buildNamespaceProvisioner` was receiving `kube` (the pod SA identity) unconditionally. The MPP `TenantNamespace` provisioner then failed with: ``` tenantnamespaces.tenant.paas.redhat.com "test" is forbidden: User "system:serviceaccount:ambient-code--ambient-s0:ambient-control-plane" cannot get resource "tenantnamespaces" in API group "tenant.paas.redhat.com" in the namespace "ambient-code--config" ``` ## Fix Three lines: prefer `projectKube` over `kube` when building the provisioner, since that IS the identity with the right permissions already in place. ## Test plan - [ ] Deploy to MPP cluster with `PROJECT_KUBE_TOKEN_FILE` set — CP logs show successful `TenantNamespace` get/create/delete without Forbidden errors 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Namespace provisioning now supports project-specific Kubernetes tokens when configured. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
The tenant-operator materializes TenantNamespace CRs as ambient-code--<id>, not ambient-code--z-<id>. The hardcoded z- prefix caused ProvisionNamespace to wait 60s for a namespace that never appeared. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
## Summary - Fixes hardcoded `ambient-code--z-` prefix in `MPPNamespaceProvisioner` — the tenant-operator materializes `TenantNamespace` CRs as `ambient-code--<id>`, not `ambient-code--z-<id>` - With the wrong prefix, `ProvisionNamespace` would create the `TenantNamespace` CR correctly but then poll for `ambient-code--z-<id>` which never appeared, timing out after 60s every time ## Test plan - [ ] Create a session on MPP cluster — namespace provisioned as `ambient-code--<project-id>` and becomes active without timeout 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Updated namespace naming conventions in the control plane provisioning system. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…eanup + CLI credential verbs (#1181) ## Summary - **fix(control-plane):** Remove `ensureCredentialRoleBindings` from `kube_reconciler.go` — this was creating K8s `RoleBinding` objects referencing a non-existent `credential:token-reader` ClusterRole, blocking session provisioning. The runner authenticates via `BOT_TOKEN` (control-plane JWT injected as a secret), not K8s SA token, so the binding was vestigial and served no purpose. - **fix(control-plane):** `project_reconciler.go` `EventDeleted` now calls `DeprovisionNamespace` instead of logging "namespace retained for safety" (was a deliberate no-op that was never wired up). - **feat(cli):** Wire `credentials` into generic `acpctl get/delete/describe` verbs (was returning "unknown resource type"). - **feat(cli):** Add `kind: Credential` support to `acpctl apply`. - **feat(cli):** Add `-o json` to `acpctl agent start`. - **feat(cli):** Add `demo-github.sh` — end-to-end GitHub credential demo script alongside `demo-kind.sh`. ## Test plan - [ ] Start a session with a credential bound to an agent — should no longer fail with `clusterroles.rbac.authorization.k8s.io "credential:token-reader" not found` - [ ] Delete a project — namespace should be deprovisioned (previously retained indefinitely) - [ ] `acpctl get credentials` / `acpctl describe credential <id>` / `acpctl delete credential <id>` work as generic verbs - [ ] `acpctl apply` with a `kind: Credential` YAML creates/patches the credential - [ ] `acpctl agent start <agent> -o json` returns JSON session object - [ ] `./components/ambient-cli/demo-github.sh` runs end-to-end with a GitHub PAT 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added JSON output format (`--output/-o`) for agent start command * Added credential management: create, list, view, and delete credentials via CLI * Included demo script for GitHub credential workflow * **Improvements** * Projects now automatically clean up their namespaces upon deletion <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
…informer (#1182) ## Summary - Transient errors (e.g. TSA RoleBinding not yet propagated when the project reconciler runs `ensureRunnerSecrets`) caused events to be permanently dropped — the informer logged the error and moved on with no retry. - Add a `retryLoop` goroutine alongside `dispatchLoop`. Failed handlers are requeued onto a buffered `retryCh` with exponential backoff: 2s → 4s → 8s → 16s → 30s (cap). - After `retryMaxAttempts` (5) the error is logged as permanent. - Fixes the race where a newly-created project namespace's TSA `RoleBinding` isn't propagated by the time `ensureRunnerSecrets` runs. ## Test plan - [ ] Create a new project — CP logs should show `namespace provisioned` and `ambient-runner-secrets created` without permanent failure - [ ] If a transient forbidden error occurs, CP logs should show `handler failed, will retry` with `attempt` and `retry_in` fields, followed by eventual success - [ ] After 5 failed attempts, CP logs `handler failed after max retries` 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added credential management support: create, delete, view, and list credentials * Added JSON output format flag to agent start command * Added GitHub integration demo script for end-to-end workflow testing * **Improvements** * Enhanced event handler resilience with exponential backoff retry logic * Improved namespace cleanup handling during project deletion <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
…nMessages (#1183) ## Summary - Runner's BOT_TOKEN is an OIDC JWT with `preferred_username=service-account-ocm-ams-service` - `AMBIENT_API_TOKEN` is not set on the api-server deployment → `IsServiceCaller` is always false - Runner's JWT was parsed as a regular user → ownership check fired → `PERMISSION_DENIED: not authorized to watch this session` - Fix: read `GRPC_SERVICE_ACCOUNT` env var at startup into a package-level var; bypass ownership enforcement when the authenticated username matches that value ## Test plan - [ ] Deploy updated api-server image - [ ] Run `demo-github.sh` — runner pod should connect to gRPC stream and execute the task without PERMISSION_DENIED - [ ] Verify session reaches `Completed` phase 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added service account authentication support for gRPC operations * Introduced automatic token refresh for long-running sessions * **Bug Fixes** * Fixed unnecessary credential patch emissions when values haven't changed * Improved event handler retry mechanisms for better reliability <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
## Summary
- Seeds `credential:token-reader` and `credential:reader` roles via
migration `202603311216`
- Mounts runner BOT_TOKEN as file with 10-min background refresh loop in
control-plane
- Authorizes runner OIDC service account in `WatchSessionMessages`
- Adds exponential backoff retry in informer error handler
- Adds `acpctl apply -f` credential manifest support and role-binding
commands to CLI
## Test plan
- [ ] Deploy to OSD `ambient-s0` via ArgoCD (gitops MR \!94 already
merged)
- [ ] Verify `credential:token-reader` and `credential:reader` appear in
`acpctl get roles`
- [ ] Create role binding for `github-agent` in `credential-test`
project
- [ ] Start agent session and confirm runner pod retrieves token via
`GET /credentials/{id}/token`
🤖 Generated with [Claude Code](https://claude.ai/code)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Release Notes
* **New Features**
* Added role seeding and management capabilities to credentials system.
* Enhanced CLI with updated login, project context, and resource
management commands.
* Introduced declarative manifest application via `acpctl apply`.
* Added agent creation and session messaging commands.
* **Bug Fixes**
* Strengthened authorization validation for session message access.
* **Documentation**
* Expanded CLI reference with comprehensive command examples and usage
patterns.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Ambient Code Bot <bot@ambient-code.local>
Co-authored-by: Claude <noreply@anthropic.com>
## Summary The `BackendURL` config field defaulted to `http://backend-service.ambient-code.svc:8080/api` — a legacy service that no longer exists. Runner pods need `BACKEND_API_URL` set to call `GET /credentials/{id}/token`. Since `AMBIENT_API_SERVER_URL` is already set in all deployments, default `BackendURL` to it. Discovered during E2E testing of the credential flow on OSD `ambient-s0`: runner logs showed DNS failures fetching credentials from the old backend URL. ## Test plan - [ ] Runner pod logs show `Successfully fetched github credentials from backend` instead of DNS failure on `backend-service.ambient-code.svc` - [ ] Agent can retrieve GitHub token via `/credentials/{id}/token` and use it 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Backend URL configuration now uses environment variable fallback logic (`BACKEND_API_URL` → `AMBIENT_API_SERVER_URL` → default to `http://localhost:8000`), enabling more flexible configuration across different deployment environments. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
…1205) # Human Edit The MPP service is `ambient-api-server.ambient-code--ambient-s0.svc`, so the change to the runner makes sense. ## Summary - The runner's cluster-local security check for `BACKEND_API_URL` only allowed `.svc.cluster.local` hostnames - OSD deployments set `AMBIENT_API_SERVER_URL` (and thus `BACKEND_API_URL`) using short-form DNS: `ambient-api-server.ambient-code--<ns>.svc:8000` - Short-form `.svc` DNS resolves only within the cluster — equivalent to `.svc.cluster.local` for security purposes - All credential fetches were silently rejected with `Refusing to send credentials to external host` ## Test plan - [ ] Deploy new runner image to OSD `ambient-s0` - [ ] Start agent session in `credential-test` project - [ ] Verify runner logs show `Fetching fresh github credentials from: http://ambient-api-server.ambient-code--ambient-s0.svc:8000/api/ambient/v1/credentials/{id}/token` - [ ] Verify `Successfully fetched github credentials from backend` 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Fixed an issue where Kubernetes service DNS names ending in `.svc` were incorrectly treated as external hosts, preventing credential transmission. These hostnames are now properly recognized as internal. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
#1206) ## Summary - `refreshAllRunningTokens` called `factory.ForProject(ctx, "")` which the SDK rejects with `"project is required"` - The refresh loop was silently failing every 10 minutes, leaving runner pod BOT_TOKEN files stale - BOT_TOKEN expiry caused `UNAUTHENTICATED` errors on the gRPC stream — runner unable to push session messages - Fix: list all projects first (using a sentinel project client for the `/projects` endpoint which ignores the project header), then list running sessions per project and refresh each token ## Test plan - [ ] Deploy updated control-plane - [ ] Start a session, wait 10+ minutes - [ ] Verify control-plane logs show token refresh success (no "project is required" warn) - [ ] Verify runner pod continues streaming without UNAUTHENTICATED errors after 15 minutes 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Improvements** * Token refresh operations now reliably support multi-project environments with enhanced error handling that prevents single-project failures from disrupting the entire refresh process. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
## Summary - The token refresh loop fires every 10 minutes via a ticker - If the control-plane restarts while sessions are running, the ticker resets and won't fire for another 10 minutes - OIDC BOT_TOKENs have a ~15 min TTL — a runner pod started near end-of-token-life can expire before the first post-restart tick - Fix: call `refreshAllRunningTokens` once immediately on goroutine start before entering the ticker loop ## Test plan - [ ] Restart control-plane while a session is running - [ ] Verify "runner token refreshed" log appears within seconds of pod start - [ ] Verify runner does not get UNAUTHENTICATED on gRPC after control-plane restart 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Token refresh now triggers immediately upon startup before entering periodic refresh cycles, improving responsiveness. * Token refresh scope expanded to include all projects, ensuring tokens are refreshed across the entire system. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
## Summary
Six commits fixing the full credential flow from control-plane
provisioning through runner token auth and session event streaming.
< /dev/null | Commit | Component | Fix |
|--------|-----------|-----|
| `085028b7` | control-plane | Credential rolebinding and project delete
|
| `a4cf9427` | control-plane | Default `BackendURL` to
`AMBIENT_API_SERVER_URL` so runner pods reach the correct API server |
| `f82422a2` | control-plane | Token refresh loop: iterate all projects
(was calling `ForProject("")` → SDK rejected empty project) |
| `49dcf935` | control-plane | Refresh runner tokens immediately on
startup (eliminates expiry gap after CP restart) |
| `2faf4424` | api-server | Lowercase session ID in runner service DNS
hostname (`session-{ID}.svc.cluster.local` must be lowercase) |
| `c16abb1d` | control-plane | Reduce token refresh interval 10m → 4m
(OIDC TTL is 15m; 10m left too small a margin for runner reconnects) |
## Root Causes Fixed
- **Runner couldn't reach API server**: `BACKEND_API_URL` defaulted to
wrong value; now falls back to `AMBIENT_API_SERVER_URL`
- **Token refresh loop silently failed**: `ForProject("")` was rejected
by SDK; fixed by listing projects first
- **Token expired after CP restart**: Ticker reset on restart created a
gap; fixed by refreshing immediately on goroutine start
- **`session events` always 502**: API server built DNS name with raw
mixed-case session ID; K8s service names are lowercased by the
control-plane
- **Token expiry under load**: 10-minute refresh interval left too
little margin before 15-minute OIDC TTL; reduced to 4 minutes
## Test plan
- [ ] Runner pod fetches credentials successfully (`Successfully fetched
github credentials from backend`)
- [ ] `acpctl session events <id>` streams without 502
- [ ] Sessions running >15 min do not hit `UNAUTHENTICATED: Token is
expired` on gRPC
🤖 Generated with [Claude Code](https://claude.ai/code)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added built-in roles seeded at startup.
* Added JWT-based username extraction fallback for bearer tokens.
* **Documentation**
* Major CLI README updates: login flow, project context commands,
resource examples, declarative `apply`, and credentials/role-binding
guides.
* **Bug Fixes**
* Fail-fast session watch authorization for unauthenticated callers.
* More frequent and multi-project token refresh with immediate refresh
on start.
* Normalize runner names for streaming URLs.
* **Chores**
* Expanded test initialization imports to include additional plugins.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Ambient Code Bot <bot@ambient-code.local>
Co-authored-by: Claude <noreply@anthropic.com>
…-56711 (#1213) ## Summary - **Eliminates async BOT_TOKEN Secret push/refresh loop** — replaces with synchronous pull from a new CP HTTP endpoint - **New `internal/tokenserver` package** — `GET /token` validates the caller's K8s SA token via TokenReview, checks `system:serviceaccount:*:session-*-sa` pattern, returns a fresh API token via `OIDCTokenProvider` - **Runner updated** — `_fetch_token_from_cp()` reads pod SA token from standard K8s mount, calls CP `/token` on startup and every gRPC reconnect; falls back to `BOT_TOKEN` env var for local dev when `AMBIENT_CP_TOKEN_URL` unset - **`kube_reconciler` cleaned up** — removes `ensureSecret`, `StartTokenRefreshLoop`, `refreshRunnerToken`, `refreshAllRunningTokens`; sets `automountServiceAccountToken: true`; injects `AMBIENT_CP_TOKEN_URL` instead of `BOT_TOKEN` secret ref ## Design rationale CP is the thing rotating the secret — if it's down for `/token` it's down for cycling too. Synchronous pull eliminates the 3-way race (CP ticker → kubelet propagation → runner read) and removes a class of stale-token failures. Spec: `docs/internal/design/control-plane.spec.md` ## Test plan - [ ] Deploy CP with `CP_TOKEN_LISTEN_ADDR=:8080` and `CP_TOKEN_URL=http://<cp-svc>:8080/token` - [ ] Start a session; confirm runner pod has `automountServiceAccountToken: true` and `AMBIENT_CP_TOKEN_URL` env set - [ ] Confirm runner fetches token on startup (log: `[GRPC CLIENT] Fetched fresh API token from CP token endpoint`) - [ ] Force gRPC reconnect; confirm fresh token fetched via CP endpoint - [ ] Verify `GET /healthz` on token server returns 200 - [ ] Confirm non-runner SA token is rejected (403) - [ ] Local dev: unset `AMBIENT_CP_TOKEN_URL`, set `BOT_TOKEN`; confirm fallback path works 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Control plane exposes a new /token endpoint that issues API tokens on demand. * Runners now obtain tokens from the control plane (AMBIENT_CP_TOKEN_URL) using their ServiceAccount tokens; BOT_TOKEN secret is no longer used. * Control plane can optionally ensure a NetworkPolicy to allow API-server access between namespaces. * **Configuration** * Added CP token settings (CP_TOKEN_LISTEN_ADDR, CP_TOKEN_URL) to control plane config. * **Documentation** * Design doc updated to describe the CP token endpoint, runner auth flow, and security implications. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
…CP_TOKEN_URL (#1214) ## Summary Follow-up to #1213. The CP token endpoint was running but unreachable because: 1. **No Service** — port 8080 (token server) had no ClusterIP Service, so runner pods had no DNS target and the NetworkPolicy peer had no stable reference 2. **Wrong `CP_RUNTIME_NAMESPACE`** — defaulted to `ambient-code--runtime-int` but the actual deployed namespace is `ambient-code--ambient-s0`, so `ensureAPIServerNetworkPolicy()` was creating a NetworkPolicy that matched the wrong namespace selector — causing `acpctl session events` to still 502 ## Changes - `ambient-control-plane-svc.yaml` — new ClusterIP Service exposing port 8080 on the CP pod - `ambient-control-plane.yaml` — inject `CP_RUNTIME_NAMESPACE` via downward API (`metadata.namespace`) so the NetworkPolicy peer label matches the actual runtime namespace; set `CP_TOKEN_URL` to the FQDN of the new Service ## Test plan - [ ] Deploy to int spoke - [ ] Verify `oc get svc ambient-control-plane -n ambient-code--ambient-s0` exists with port 8080 - [ ] Start new session; verify `allow-ambient-api-server` NetworkPolicy in session namespace uses correct namespace selector (`ambient-code--ambient-s0`) - [ ] `acpctl session events <id>` streams without 502 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added ambient control plane service with token authentication capability * Configured control plane to integrate with token service endpoint for runtime authentication <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
…ken server (#1215) ## Summary - Adds `ambient-cp-token-netpol.yaml`: NetworkPolicy in the CP namespace allowing runner pods (any namespace with `tenant.paas.redhat.com/tenant: ambient-code`) to call the CP token server on port 8080 - Namespace placeholder is `ambient-code--runtime-int`; actual spoke namespace is patched by the GitOps config repo ## Context Runner pods were crashing with `CP token endpoint unreachable after 3 attempts` because the `internal-1` NetworkPolicy in the CP namespace blocks cross-namespace ingress by default. This NetworkPolicy was applied manually as a hotfix and this PR adds it to the manifests. ## Test plan - [ ] Deploy to spoke and confirm runner pods can reach `CP_TOKEN_URL` on startup - [ ] Confirm `acpctl session events $id` streams without 502 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Ambient Code Bot <bot@ambient-code.local>
…nner token endpoint (#1216) ## Summary - CP bootstraps an RSA-4096 keypair Secret (`ambient-cp-token-keypair`) in its namespace on startup via the project kube client; generates if missing - Private key loaded into token server for decryption; public key injected as `AMBIENT_CP_TOKEN_PUBLIC_KEY` into all runner Job pods - Runners RSA-OAEP/SHA-256 encrypt their `SESSION_ID` with the public key, send base64 ciphertext as `Authorization: Bearer` - CP decrypts to verify the caller — no `TokenReview` cluster permission required - Keypair persists across CP restarts in the K8s Secret; future path is Vault-backed ExternalSecret with no code change ## Motivation The CP SA does not have (and cannot be granted via tenant operator) cluster-scoped `create tokenreviews` permission. The previous `TokenReview`-based validation returned 401 for all runners. ## Test plan - [ ] Deploy CP — confirm `ambient-cp-token-keypair` Secret created in CP namespace on first boot - [ ] Create a session — confirm runner pod starts without `CP token endpoint unreachable` error - [ ] Confirm `acpctl session events $id` streams without 502 - [ ] Restart CP — confirm runner pods created after restart can still fetch tokens 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Automatic control-plane token keypair bootstrap on startup. * Added token server health check endpoint. * **Refactor** * Token authentication now uses RSA-encrypted session IDs with local cryptographic validation. * Runner now encrypts session IDs with the control-plane public key; public key is injected into runtime containers. * NetworkPolicy added to restrict access to the token endpoint. * **Chores** * Added cryptography dependency. * **Tests** * New unit tests for keypair bootstrapping, token handling, and runner token fetch behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
…end API calls (#1217) ## Summary - After fetching from the CP token endpoint, `_fetch_token_from_cp()` now calls `set_bot_token(token)` to store the OIDC token in a module-level cache in `utils.py` - `get_bot_token()` checks that cache first, so `auth.py` credential fetches to the backend API use the OIDC token instead of an empty string - Adds a regression test that verifies `get_bot_token()` is empty before a CP fetch and returns the token after — confirmed to fail without the fix ## Root cause The CP-fetched OIDC token was stored only in `AmbientGRPCClient._token` (gRPC channel auth). `auth.py`'s `get_bot_token()` had no access to it, so credential token fetches went out unauthenticated → HTTP 401 on every session run. 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added caching for bot tokens fetched from the control plane for improved performance. * Updated token sourcing priority to prefer control-plane tokens over other sources. * **Tests** * Added integration tests for token fetching and caching mechanisms. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
The runner was using the CP-fetched OIDC token (get_bot_token()) when
calling GET /credentials/{id}/token on the backend. The backend's
enforceCredentialRBAC only classifies a caller as isBotToken=true when
SelfSubjectReview resolves to system:serviceaccount:* — the CP OIDC
token is not a K8s SA token and fails this check, resulting in HTTP 401.
Fix: use the K8s SA token mounted at
/var/run/secrets/kubernetes.io/serviceaccount/token as the primary
credential for backend calls when no caller token is present. The SA
token authenticates as system:serviceaccount:<ns>:<sa> which the backend
trusts as isBotToken=true and grants access to the session owner's
credentials.
Adds get_sa_token() to platform/utils.py and two regression tests.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
Important Review skippedToo many files! This PR contains 255 files, which is 105 over the limit of 150. ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (45)
📒 Files selected for processing (255)
You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| } else if (navState.type === 'pa') { | ||
| var proj = allProjects.find(function(x) { return x.id === navState.projectId; }) || { name: navState.projectId, id: navState.projectId }; | ||
| var pa = (allPas[navState.projectId] || []).find(function(x) { return x.id === navState.paId; }); | ||
| bar.append(sep + '<a class="bc-item" href="#" onclick="navProject(\'' + proj.id + '\');return false;"><i class="fas fa-folder-open me-1"></i>' + escHtml(proj.name) + '</a>'); |
Check failure
Code scanning / CodeQL
DOM text reinterpreted as HTML High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 5 hours ago
In general, fix this by avoiding reinterpreting unescaped values from the DOM as HTML—either by using DOM APIs (creating elements and setting their properties/attributes) instead of HTML string concatenation, or by properly encoding/escaping any untrusted values for the specific context (HTML attribute, JavaScript string, etc.) before insertion.
The best fix here without changing functionality is to stop interpolating proj.id (and similar IDs) into raw HTML strings and instead create the breadcrumb links using jQuery’s element creation APIs, setting text, attr, and on('click', ...) directly. This way, proj.id is never parsed as HTML or as part of an inline JavaScript attribute, eliminating the XSS vector while preserving the visual layout and navigation behavior. Specifically, in renderBreadcrumb() we should:
- Keep
sepas a static HTML string (it contains no untrusted data). - Replace the
bar.append(...)calls that concatenateproj.id,paForS.id, andprojForS.idinto<a ... onclick="...">...</a>strings with code that:- Creates an
<a>element via$('<a></a>'). - Assigns
class="bc-item"andhref="#". - Sets the inner HTML of the icon and label, using
escHtmlfor any dynamic label text. - Attaches a click handler with
on('click', ...)that calls the appropriate navigation function andreturn false;.
- Creates an
This requires only edits within renderBreadcrumb() in components/ambient-sdk/ts-sdk/example/index.html; no new imports or helpers are necessary.
| @@ -437,21 +437,43 @@ | ||
| bar.append('<span class="bc-current"><i class="fas fa-layer-group me-1"></i>All Projects</span>'); | ||
| return; | ||
| } | ||
| bar.append('<a class="bc-item" href="#" onclick="navProjects();return false;"><i class="fas fa-layer-group me-1"></i>All Projects</a>'); | ||
| var allProjectsLink = $('<a class="bc-item" href="#"></a>'); | ||
| allProjectsLink.html('<i class="fas fa-layer-group me-1"></i>All Projects'); | ||
| allProjectsLink.on('click', function(e) { e.preventDefault(); navProjects(); }); | ||
| bar.append(allProjectsLink); | ||
| if (navState.type === 'project') { | ||
| var p = allProjects.find(function(x) { return x.id === navState.id; }) || { name: navState.id }; | ||
| bar.append(sep + '<span class="bc-current"><i class="fas fa-folder-open me-1"></i>' + escHtml(p.name) + '</span>'); | ||
| } else if (navState.type === 'pa') { | ||
| var proj = allProjects.find(function(x) { return x.id === navState.projectId; }) || { name: navState.projectId, id: navState.projectId }; | ||
| var pa = (allPas[navState.projectId] || []).find(function(x) { return x.id === navState.paId; }); | ||
| bar.append(sep + '<a class="bc-item" href="#" onclick="navProject(\'' + proj.id + '\');return false;"><i class="fas fa-folder-open me-1"></i>' + escHtml(proj.name) + '</a>'); | ||
| var projLink = $('<a class="bc-item" href="#"></a>'); | ||
| projLink.html('<i class="fas fa-folder-open me-1"></i>' + escHtml(proj.name)); | ||
| projLink.on('click', function(e) { e.preventDefault(); navProject(proj.id); }); | ||
| bar.append(sep); | ||
| bar.append(projLink); | ||
| bar.append(sep + '<span class="bc-current"><i class="fas fa-robot me-1"></i>' + escHtml(pa ? pa.name : navState.paId) + '</span>'); | ||
| } else if (navState.type === 'session') { | ||
| var s = findSession(navState.id); | ||
| var paForS = findPaForSession(navState.id); | ||
| var projForS = paForS ? findProjectForPa(paForS.id) : null; | ||
| if (projForS) bar.append(sep + '<a class="bc-item" href="#" onclick="navProject(\'' + projForS.id + '\');return false;"><i class="fas fa-folder-open me-1"></i>' + escHtml(projForS.name) + '</a>'); | ||
| if (paForS) bar.append(sep + '<a class="bc-item" href="#" onclick="navPa(\'' + (projForS ? projForS.id : '') + '\',\'' + paForS.id + '\');return false;"><i class="fas fa-robot me-1"></i>' + escHtml(paForS.name) + '</a>'); | ||
| if (projForS) { | ||
| var projForSLink = $('<a class="bc-item" href="#"></a>'); | ||
| projForSLink.html('<i class="fas fa-folder-open me-1"></i>' + escHtml(projForS.name)); | ||
| projForSLink.on('click', function(e) { e.preventDefault(); navProject(projForS.id); }); | ||
| bar.append(sep); | ||
| bar.append(projForSLink); | ||
| } | ||
| if (paForS) { | ||
| var paForSLink = $('<a class="bc-item" href="#"></a>'); | ||
| paForSLink.html('<i class="fas fa-robot me-1"></i>' + escHtml(paForS.name)); | ||
| paForSLink.on('click', function(e) { | ||
| e.preventDefault(); | ||
| navPa(projForS ? projForS.id : '', paForS.id); | ||
| }); | ||
| bar.append(sep); | ||
| bar.append(paForSLink); | ||
| } | ||
| bar.append(sep + '<span class="bc-current"><i class="fas fa-terminal me-1"></i>' + escHtml(s ? s.name : navState.id) + '</span>'); | ||
| } | ||
| } |
| }); | ||
| html += '</div>'; | ||
| } | ||
| $('#mainPanel').html(html); |
Check failure
Code scanning / CodeQL
DOM text reinterpreted as HTML High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 5 hours ago
To fix this class of issue, untrusted data must never be injected into the DOM as HTML. Either (a) escape meta-characters before concatenating into HTML strings, or (b) avoid .html() and instead build DOM nodes using safe APIs like .text() or document.createElement, setting user-controlled parts via textContent. Given the surrounding code already uses an escHtml helper and relies heavily on string-built HTML, the least intrusive fix is to systematically escape any untrusted values before concatenation.
For the specific path CodeQL highlights, the vulnerable sink is $('#mainPanel').html(html); in renderPaPanel, with html tainted via renderAnnotationsPanel. The safest, minimal-change fix is to ensure that any data passed into renderAnnotationsPanel that can carry untrusted content (particularly projectId and agentId) is escaped before being embedded into the onclick attribute at line 884. There is already an escHtml function being used nearby (e.g., escHtml(pa.prompt), escHtml(pa.id)), so we should apply the same escaping when constructing the onclick handler string. Since renderAnnotationsPanel uses JSON.stringify to serialize projectId and agentId into JavaScript string literals inside an HTML attribute, we need to ensure those JSON strings are properly escaped for HTML context; we can do this by running them through escHtml when building html.
Concretely:
- In
renderAnnotationsPanel, at the section around lines 882–885, wrap theJSON.stringify(projectId||'')andJSON.stringify(agentId||'')calls inescHtml(...)before concatenating them into theonclickattribute string. - This keeps the logic intact (still passes the same JS values) but ensures any
<,>,&, or quote characters originating fromprojectId/agentIdcannot break out into HTML/JS. - No new imports or libraries are required;
escHtmlalready exists in this file and is used elsewhere.
| @@ -881,7 +881,7 @@ | ||
| } | ||
| html += '<div class="p-2" style="border-top:1px solid rgba(255,255,255,.06)">' + | ||
| '<button class="btn btn-outline-secondary btn-sm" style="font-size:.72rem" ' + | ||
| 'onclick="openEditAnnotation(' + JSON.stringify(resourceType) + ',' + JSON.stringify(projectId||'') + ',' + JSON.stringify(agentId||'') + ',\'\',\'\')">' + | ||
| 'onclick="openEditAnnotation(' + JSON.stringify(resourceType) + ',' + escHtml(JSON.stringify(projectId||'')) + ',' + escHtml(JSON.stringify(agentId||'')) + ',\'\',\'\')">' + | ||
| '<i class="fas fa-plus me-1"></i>Add annotation</button></div>'; | ||
| html += '</div></details>'; | ||
| return html; |
Summary
GET /credentials/{id}/token— the backend'senforceCredentialRBACrequiresSelfSubjectReviewto resolve assystem:serviceaccount:*forisBotToken=true; the OIDC token fails this and returns HTTP 401_fetch_credentialnow uses the K8s SA token (mounted at/var/run/secrets/kubernetes.io/serviceaccount/token) as the primary auth for backend credential calls when no caller token is presentget_bot_token()(CP OIDC);get_bot_token()remains as fallback for local dev without a mounted SA tokenTest plan
TestFetchCredentialSAToken::test_uses_sa_token_when_no_caller_token— verifies SA token is sent to backendTestFetchCredentialSAToken::test_sa_token_preferred_over_bot_token_when_no_caller_token— verifies SA token wins over CP OIDC tokentest_shared_session_credentials.pypass🤖 Generated with Claude Code