ci-rootcause

Deterministic multi-agent CI root-cause analysis. Not a log summarizer.

Why This Exists

Most AI CI tools summarize logs. That is not enough.

CI failures are execution-system failures: one early failure can create dozens of downstream symptoms, and the useful answer is the first evidence-backed root cause, not a plausible paragraph about the log.

ci-rootcause treats CI as a deterministic debugging problem:

reconstruct the failure structure from logs and CI metadata,
compare the failing run against the code diff,
identify the first failure point,
rank root causes with deterministic confidence,
produce reproducible artifacts,
optionally create a guarded fix PR that must pass validation.

What It Does

Primary path: install the GitHub App, receive RCA comments on failed workflow_run events, and optionally allow guarded fix PRs.

Core outputs:

Structured failure graph
Deterministic failure classification
Diff-aware root-cause ranking
Deterministic confidence score
Evidence-backed suggested fix
Optional agentic fix proposal with deterministic validation
Optional guarded fix PR, never auto-merged
ci-rca.json and ci-rca.md artifacts
ci-rca-observability.json telemetry with trace, timing, and reason-code data

Supported failure classes in the current benchmark: TYPECHECK, LINT, TEST, DEPENDENCY, and INFRA.

Proven Results

Curated MVP benchmark (13 cases):

Metric	Result
Classification accuracy	`100%` (`13/13`)
Baseline classification accuracy	`69.23%` (`9/13`)
Classification lift	`+30.77` percentage points, about `44.4%` relative lift
Top-1 root-cause accuracy	`100%` (`12/12` applicable cases)
Agentic proposal validity	`100%` (`6/6` exercised cases)
Guarded validation gate	`50%` (`3/6`, three valid fixes passed and three bad fixes were blocked)
Artifact hash reproducibility	`100%`
Confidence reproducibility	`100%`

Validated coverage so far:

Automated test suite: 299 tests passing, 1 opt-in live GitHub test skipped by default.
Benchmark failure classes: TYPECHECK, LINT, TEST, DEPENDENCY, and INFRA.
Agentic benchmark coverage: 6 proposal cases across lint, test, and typecheck fixes.
Guardrails: safe default comment-only mode, PR opt-in gate, confidence threshold, scoped file changes, validation pass/fail, missing hosted API key, malformed proposal retry, low-signal comment suppression, and formatted fix commits.
Live GitHub App smoke coverage: real workflow_run webhooks, PR comment create/update, typecheck/dependency/infra failures, local/Ollama suggestions, validation-failed PR gate, async webhook acknowledgement for slow local models, and app-created fix PRs that pass repository CI.

Reports:

Example Output

ci-rca.json is designed for automation and reproducible comparison:

{
  "classification": "TYPECHECK",
  "confidence": 0.9,
  "primary_root_cause": {
    "title": "Argument 1 to needs_int has incompatible type str; expected int",
    "file": "app_failure_typecheck.py",
    "line": 5,
    "evidence": [
      "mypy reported an incompatible argument type",
      "the failing file changed in the PR diff",
      "the failure is the first reported error in the job"
    ]
  },
  "suggested_fix": "Change the string argument to an integer and run the typecheck validation command."
}

ci-rca.md is designed for humans:

First failure: app_failure_typecheck.py:5
Root cause: string passed where an integer is required
Suggested fix: replace needs_int("7") with needs_int(7)
Guardrail: create a fix PR only when confidence and validation pass

What Makes This Different

Deterministic confidence scoring, not LLM-generated confidence.
Diff-aware analysis, so changed files influence ranking.
Reproducible output artifacts for regression testing and audits.
Structured failure graph reconstruction instead of raw log summarization.
Machine-readable reason codes for skipped, partial, and failed app outcomes.
Guarded PR creation with confidence, file-scope, and validation gates.
Provider-optional agentic mode: deterministic by default, local/Ollama or hosted LLMs when explicitly enabled.

App-First Quickstart

Recommended path for new users: install the GitHub App. No workflow YAML is required in target repositories.

Install the GitHub App on the target repository.
Start with safe defaults:
- CI_ROOTCAUSE_APP_ENABLED=true
- CI_ROOTCAUSE_APP_POST_COMMENT=true
- CI_ROOTCAUSE_APP_ENABLE_PR_MODE=false
- CI_ROOTCAUSE_APP_CREATE_FIX_PR=false
Trigger a failed workflow_run.
Verify the PR or commit receives an RCA comment with:
- classification,
- confidence,
- first-failure evidence,
- suggested fix,
- artifact paths,
- outcome reason codes.
Enable fix PR creation only after comment-only behavior is trusted:
- CI_ROOTCAUSE_APP_ENABLE_PR_MODE=true
- CI_ROOTCAUSE_APP_CREATE_FIX_PR=true
- CI_ROOTCAUSE_APP_MIN_PR_CONFIDENCE=0.75

Setup references:

Optional GitHub Action Usage

The GitHub App is the primary product path. The composite action remains available for workflow-based users and migration support.

- uses: ibrahim1023/ci-rootcause-action@v0
  with:
    github_token: ${{ secrets.GITHUB_TOKEN }}
    post_pr_comment: true
    create_fix_pr: false
    max_fix_files: 5

Useful optional inputs:

mode: deterministic, agentic_assist, or agentic_full
provider: local, openai, gemini, or anthropic
provider_api_key: required for hosted providers in agentic modes
validation_commands: semicolon or newline separated validation commands
min_pr_confidence: minimum confidence before guarded PR creation
offline_only: skip remote PR creation/network calls

Agentic Modes

Recommended default for new users: deterministic.

Mode	Autonomy	Key requirement	Cost profile	Risk profile
`deterministic`	Rule-based diagnosis and fix planning	None	Lowest	Lowest
`agentic_assist`	LLM proposes candidate fix steps; deterministic pipeline validates or falls back	Hosted providers require API key; `local` does not	Medium	Low-medium
`agentic_full`	Highest autonomy path with explicit opt-in gate	Hosted providers require API key; `local` does not	Highest	Highest

Provider support:

Hosted: openai, gemini, anthropic.
Local: local for Ollama-compatible endpoints.
Recommended local default: qwen2.5-coder:3b.
Use qwen2.5-coder:7b when local hardware can tolerate slower responses.

GitHub App examples:

export CI_ROOTCAUSE_APP_MODE=agentic_assist
export CI_ROOTCAUSE_APP_LLM_PROVIDER=local
export CI_ROOTCAUSE_APP_LLM_MODEL=qwen2.5-coder:3b
export CI_ROOTCAUSE_APP_LLM_BASE_URL=http://localhost:11434

export CI_ROOTCAUSE_APP_MODE=agentic_assist
export CI_ROOTCAUSE_APP_LLM_PROVIDER=openai
export CI_ROOTCAUSE_APP_LLM_API_KEY=<secret>

Evaluation

We evaluate correctness and safety, not just plausibility.

Current gates include:

classification match rate,
primary root-cause accuracy,
top-1 root-cause accuracy,
confidence reproducibility,
artifact hash reproducibility,
agentic proposal schema validity,
validation gate behavior,
app reason-code stability.

Run the benchmark locally:

python scripts/run_benchmark.py \
  --suite fixtures/benchmarks/mvp-suite.json \
  --output-root artifacts/benchmark-mvp \
  --report-json docs/reports/mvp-benchmark-report.json \
  --report-md docs/reports/mvp-benchmark-report.md

Compare local Ollama models:

python scripts/run_ollama_comparison.py \
  --suite fixtures/benchmarks/mvp-suite.json \
  --llm-model qwen2.5-coder:3b \
  --report-json artifacts/ollama-comparison/qwen2.5-coder-3b.json

Who This Is For

Developers debugging failed CI runs.
Teams dealing with noisy or flaky pipelines.
AI infra teams building reliable automation.
Engineering teams that need explainable, repeatable CI debugging.
Maintainers who want suggested fixes without unsafe auto-merges.

Architecture Overview

flowchart LR
  A[CI logs + diff] --> B[Log ingest agent]
  A --> C[Diff analysis agent]
  B --> D[Failure classification agent]
  C --> E[Root-cause ranker agent]
  D --> E
  E --> F[Fix planner agent]
  E --> G[Reporter agent]
  F --> H[PR creation agent]
  G --> I[Artifacts: ci-rca.json + ci-rca.md]
  H --> J[Guarded fix PR]

Execution order is deterministic and fixed:

log_ingest
diff_analysis
failure_classification
root_cause_ranker
fix_planner
reporter
pr_creation

Runtime behavior:

Uses Google ADK runtime orchestration when google-adk is installed.
Falls back to deterministic local orchestration if ADK runtime initialization fails.
Uses deterministic local orchestration when --fail-fast is enabled.

Design Constraints

No LLM-based confidence scoring.
No hallucinated root causes without log or diff evidence.
Fixes must be evidence-backed and scoped.
PR creation is opt-in and guarded.
No auto-merge.
No branch-protection bypass.
No repo-wide autonomous refactors.
Deterministic output is preferred over clever but unstable behavior.

Vision

CI debugging should be deterministic, explainable, and reproducible.

The goal is not to replace CI, linters, typecheckers, or test suites. The goal is to connect failed execution evidence to the smallest useful diagnosis and, when explicitly enabled, a reviewable fix PR.

Local Setup

Requirements:

Python 3.11+

Install tools:

python -m pip install --upgrade pip
pip install -r requirements.txt
pre-commit install

Run checks:

ruff check .
ruff format --check .
pytest

CLI Quickstart

Run the local deterministic pipeline once:

ci-rootcause \
  --log-path fixtures/ci-logs/github-actions-python-failure.log \
  --diff-path fixtures/diffs/refactor-only.diff \
  --output-dir artifacts \
  --timestamp 2026-02-21T00:00:00Z \
  --commit abc123 \
  --run-id gha_quickstart_1 \
  --base-commit abc122 \
  --head-commit abc123 \
  --repository owner/repo

Inspect generated artifacts:

artifacts/ci-rca.json
artifacts/ci-rca.md

CLI behavior:

Writes ci-rca.json and ci-rca.md into --output-dir.
Prints a machine-readable JSON summary to stdout.
Exits 0 for completed/partial analysis runs, 2 for runtime/input errors.
Supports optional deterministic flaky-test detection via --historical-runs-path.
Supports local --config-path and single-stream stdin input via -.
Supports --offline-only to force no remote PR creation/network calls.
Supports rollout profile --profile safe-github-rollout.

Demo Fixtures

Run three reproducible demo scenarios:

for case in \
  fixtures/demos/01-dependency-lockfile-drift \
  fixtures/demos/02-typecheck-ts2345 \
  fixtures/demos/03-infra-timeout
do
  name="$(basename "$case")"
  ci-rootcause \
    --log-path "$case/ci.log" \
    --diff-path "$case/change.diff" \
    --output-dir "artifacts/demo/$name" \
    --timestamp 2026-02-21T00:00:00Z \
    --commit abc123 \
    --run-id "demo_${name}" \
    --base-commit abc122 \
    --head-commit abc123 \
    --repository owner/repo
done

Demo fixture pack:

fixtures/demos/README.md
fixtures/demos/01-dependency-lockfile-drift
fixtures/demos/02-typecheck-ts2345
fixtures/demos/03-infra-timeout

Live GitHub Integration Test

Live PR creation/idempotency validation is available as an opt-in integration test:

scripts/run_live_github_test.sh \
  --repo-path /path/to/disposable/repo \
  --repository owner/repo \
  --token ghp_xxx \
  --target-branch main

Notes:

Test is skipped unless CI_ROOTCAUSE_LIVE_GITHUB=1.
Use a disposable repository with push and PR permissions.
Script prints a cleanup checklist after the test run.

Current Limits

Current benchmark coverage is curated and intentionally small.
Classification is deterministic-pattern based and may miss unseen signatures.
Fix generation is intentionally conservative and validation-gated.
App mode currently targets GitHub Actions workflow_run events.
CI rerun orchestration is not included.
Automatic merge and branch-protection bypass are not supported.

Roadmap

Expand the curated CI failure dataset.
Add deeper language-specific analyzers.
Improve dependency drift and lockfile diagnostics.
Add stronger diff-to-failure linking for multi-file changes.
Add staging/deployment packaging for the GitHub App service.
Expand CI provider support beyond GitHub Actions.

Contributing

Contribution standards are documented in CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github		.github
agents/generated		agents/generated
artifacts/benchmark-mvp		artifacts/benchmark-mvp
config		config
docs		docs
evals		evals
fixtures		fixtures
scripts		scripts
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
action.yml		action.yml
ci-rca-observability.json		ci-rca-observability.json
ci-rca.json		ci-rca.json
ci-rca.md		ci-rca.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ci-rootcause

Why This Exists

What It Does

Proven Results

Example Output

What Makes This Different

App-First Quickstart

Optional GitHub Action Usage

Agentic Modes

Evaluation

Who This Is For

Architecture Overview

Design Constraints

Vision

Local Setup

CLI Quickstart

Demo Fixtures

Live GitHub Integration Test

Current Limits

Roadmap

Contributing

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ci-rootcause

Why This Exists

What It Does

Proven Results

Example Output

What Makes This Different

App-First Quickstart

Optional GitHub Action Usage

Agentic Modes

Evaluation

Who This Is For

Architecture Overview

Design Constraints

Vision

Local Setup

CLI Quickstart

Demo Fixtures

Live GitHub Integration Test

Current Limits

Roadmap

Contributing

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages