Skip to content

[ab-advisor] Experiment campaign for go-logger: A/B test max_turns #36847

@github-actions

Description

@github-actions

🧪 Experiment Campaign: go-logger

Workflow file: .github/workflows/go-logger.md
Selected dimension: max_turns
Triggered by: ab-testing-advisor on 2026-06-04


Background

The go-logger workflow analyzes Go source files in pkg/ and adds meaningful debug logging statements, capping each run at 5 files per PR. Its dense prompt involves multi-step discovery, per-file analysis, editing, build validation, and PR creation — a workflow with substantial sequential depth. Testing max_turns is warranted because the current timeout-minutes: 15 cap may be masking turn exhaustion on complex files, or conversely, generous turn budgets may allow the agent to over-deliberate on simple logging decisions, wasting tokens without quality gains.

Hypothesis

H0 (null): Changing the maximum turn budget does not affect the number of files successfully logged per run, output quality (checklist compliance), or token cost.

H1 (alternative): A tighter turn budget (conservative) reduces token cost by 20–30% with minimal quality loss on straightforward files, while a generous budget (generous) improves checklist compliance on complex files but at higher cost.

Experiment Configuration

Add the following experiments: block to the workflow frontmatter:

experiments:
  max_turns_budget:
    variants: [conservative, standard, generous]
    description: "Tests whether the agent turn budget affects logging quality and token cost. Conservative caps deliberation early; generous allows deeper analysis of complex files."
    hypothesis: "H0: no change in files-logged-per-run or checklist compliance across turn budgets. H1: conservative reduces cost 20-30% with <5% quality loss; generous improves complex-file compliance by >10%."
    metric: files_successfully_logged_per_run
    secondary_metrics: [token_cost_per_file, checklist_compliance_rate, run_duration_ms]
    guardrail_metrics:
      - name: empty_pr_rate
        direction: min
        threshold: 0.10
      - name: build_failure_rate
        direction: min
        threshold: 0.05
    min_samples: 15
    weight: [33, 34, 33]
    start_date: "2026-06-04"
    analysis_type: mann_whitney
    tags: [cost-efficiency, agent-turns, go-logging]
    notify:
      discussion: 0
    issue: 0

Variant descriptions:

  • conservative: Sets max_turns: 20 — tight budget forcing the agent to commit to logging decisions quickly without excessive deliberation. Expected to reduce token usage by 20–30% on simple files.
  • standard: Sets max_turns: 35 — baseline matching current observed behavior given the 15-minute timeout and typical file complexity.
  • generous: Sets max_turns: 55 — expanded budget allowing deeper analysis of complex multi-function files, expected to improve checklist compliance but at higher token cost.

Workflow Changes Required

The max_turns field must be added to the agent: configuration block in the compiled lock file. The experiment controls this via conditional blocks in the workflow markdown.

Before (no max_turns configured, runtime default applies):

engine: claude
timeout-minutes: 15

After (add to frontmatter, inside the agent prompt region):

engine: claude
timeout-minutes: 15
{{#if experiments.max_turns_budget == "conservative" }}
max_turns: 20
{{else if experiments.max_turns_budget == "generous" }}
max_turns: 55
{{else}}
max_turns: 35
{{/if}}
View full frontmatter diff
  engine: claude
  name: Go Logger Enhancement
  timeout-minutes: 15
+ {{#if experiments.max_turns_budget == "conservative" }}
+ max_turns: 20
+ {{else if experiments.max_turns_budget == "generous" }}
+ max_turns: 55
+ {{else}}
+ max_turns: 35
+ {{/if}}
  tools:

Success Metrics

Metric Type Target
files_successfully_logged_per_run Primary ≥ 4 of 5 files across all variants
token_cost_per_file Secondary conservative ≤ 80% of standard cost
checklist_compliance_rate Secondary generous ≥ 105% of standard
run_duration_ms Secondary Signal only — no target
empty_pr_rate Guardrail Must stay < 10% per variant
build_failure_rate Guardrail Must stay < 5% per variant

Statistical Design

  • Variants: conservative, standard, generous
  • Assignment: Round-robin via gh-aw experiments runtime (cache-based)
  • Minimum runs per variant: 15 (45 total runs across all variants)
  • Expected experiment duration: ~45 days (daily schedule, 1 run/day, 3-way split)
  • Analysis approach: Mann-Whitney U test (non-parametric; run counts and durations are not normally distributed)

Implementation Steps

  • Add experiments: section to frontmatter (YAML above)
  • Add {{#if experiments.max_turns_budget == "<variant>" }} conditional blocks around max_turns: in the workflow
  • Run gh aw compile go-logger to regenerate the lock file
  • Monitor experiment artifact uploaded per run to /tmp/gh-aw/agent/experiments/state.json
  • After 45+ runs (15 per variant), analyze variant distribution via workflow run artifacts
  • Document findings and promote winning variant; update max_turns to winning value and remove experiment block

Infrastructure Note

✅ The analysis_type, tags, and notify experiment schema fields are fully implemented end-to-end in both pkg/workflow/compiler_experiments.go and actions/setup/js/pick_experiment.cjs. No infrastructure sub-issue is required.

References

Generated by 🧪 Daily A/B Testing Advisor · sonnet46 1.5M ·

  • expires on Jun 18, 2026, 5:45 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions