feat(benchmark): add three-file shard variant to Django benchmark#124
Conversation
…gle-graph Adds a third Docker container (bench-threefile) that runs supermodel analyze --three-file to generate .calls/.deps/.impact shards instead of single .graph files. run.sh now builds and runs all three containers sequentially: 1. bench-naked (baseline, no graph data) 2. bench-supermodel (single .graph files) 3. bench-threefile (three separate .calls/.deps/.impact files) Benchmark data shows three-file format is 68% faster than baseline and significantly outperforms single .graph on Medusa (68K nodes).
WalkthroughA new benchmarking pipeline is added for testing Claude Code analysis against Django using supermodel's three-file shard feature. The changes include documentation, Docker containerization, an orchestration script, and updates to the main benchmark runner to execute this new variant alongside existing benchmarks. Changes
Sequence DiagramsequenceDiagram
participant User
participant run.sh as Benchmark Runner
participant Docker as Docker Container
participant Tests as Django Tests
participant supermodel as supermodel CLI
participant Claude as Claude API
participant Log as Output Files
User->>run.sh: Execute benchmark
run.sh->>Docker: Build Dockerfile.threefile
run.sh->>Docker: Run container (pass API keys)
Docker->>Tests: Run initial test suite
Tests-->>Log: Record failures
Docker->>supermodel: Run analyze --three-file /app
supermodel-->>Docker: Generate shards (calls, deps, impact)
Docker-->>Log: Save supermodel output
Docker->>Claude: Execute claude with task.md
Claude->>Claude: Use tool hooks (supermodel hook)
Claude-->>Log: Stream results + save raw logs
Docker->>Tests: Re-run test suite
Tests-->>Log: Record final results
Docker->>Log: Extract cost metrics
Docker-->>Log: Print COST SUMMARY
run.sh->>Log: Compare all benchmark results
run.sh-->>User: Display comparison
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
benchmark/Dockerfile.threefile (1)
18-21: Add--no-install-recommendsto apt-get install calls for a leaner image.Lines 18 and 20 install extra recommended packages by default. This bloats the Docker image with unnecessary dependencies that aren't actually needed for the benchmark to run. By adding
--no-install-recommends, you're telling apt: "give me just what I asked for, nothing extra."Think of it like ordering a burger—you only want the burger, not the upselling of fries, drink, and a toy. Same idea here.
♻️ Suggested fix
-RUN apt-get update && apt-get install -y curl ca-certificates git && \ +RUN apt-get update && apt-get install -y --no-install-recommends curl ca-certificates git && \ curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \ - apt-get install -y nodejs && \ + apt-get install -y --no-install-recommends nodejs && \ rm -rf /var/lib/apt/lists/*🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmark/Dockerfile.threefile` around lines 18 - 21, Update the RUN instruction that performs apt-get install to use --no-install-recommends: modify the apt-get install invocations in the RUN line (the segments that install "curl ca-certificates git" and "nodejs") to include --no-install-recommends so apt-get install -y --no-install-recommends curl ca-certificates git and apt-get install -y --no-install-recommends nodejs, keeping the existing apt-get update, node setup step, and cleanup (rm -rf /var/lib/apt/lists/*) intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmark/entrypoint.threefile.sh`:
- Line 54: The grep used to extract cost ("grep
'\"costUSD\"\\|\"total_cost_usd\"' /tmp/claude_raw.txt") is too narrow and
misses variants like cost_usd or different casing; update the command in the
entrypoint.threefile.sh pipeline to use an extended, case-insensitive regex that
matches common variants (e.g. costUSD, cost_usd, total_cost_usd, totalCostUsd) —
for example replace the grep with grep -E -i and a pattern like
"cost(_|)usd|total(_|)cost(_|)usd" so the cost extraction in that line reliably
finds all expected key names.
In `@benchmark/run.sh`:
- Line 76: The call to "$SCRIPT_DIR/compare.sh" on line with
"$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"
"$RESULTS_DIR/threefile.txt" passes three files but compare.sh only reads $1 and
$2, so include the third file by invoking compare.sh for each pair instead of a
single three-arg call; update run.sh to call "$SCRIPT_DIR/compare.sh" with
("$RESULTS_DIR/naked.txt","$RESULTS_DIR/supermodel.txt"), then again with
("$RESULTS_DIR/naked.txt","$RESULTS_DIR/threefile.txt") and/or
("$RESULTS_DIR/supermodel.txt","$RESULTS_DIR/threefile.txt") as needed so all
comparisons are produced without changing compare.sh.
---
Nitpick comments:
In `@benchmark/Dockerfile.threefile`:
- Around line 18-21: Update the RUN instruction that performs apt-get install to
use --no-install-recommends: modify the apt-get install invocations in the RUN
line (the segments that install "curl ca-certificates git" and "nodejs") to
include --no-install-recommends so apt-get install -y --no-install-recommends
curl ca-certificates git and apt-get install -y --no-install-recommends nodejs,
keeping the existing apt-get update, node setup step, and cleanup (rm -rf
/var/lib/apt/lists/*) intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 76587135-b5ee-4204-842c-df8c2042404b
📒 Files selected for processing (4)
benchmark/CLAUDE.threefile.mdbenchmark/Dockerfile.threefilebenchmark/entrypoint.threefile.shbenchmark/run.sh
| echo "============================================================" | ||
| echo "COST SUMMARY" | ||
| echo "============================================================" | ||
| grep '"costUSD"\|"total_cost_usd"' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)" |
There was a problem hiding this comment.
Line [54] cost extraction is narrower than the existing parser logic.
This can miss valid outputs (for example cost_usd), which makes the summary look empty even when cost exists.
🛠️ Suggested robustness update
-grep '"costUSD"\|"total_cost_usd"' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"
+grep '"costUSD"\|"cost_usd"\|"total_cost_usd"\|Total cost' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| grep '"costUSD"\|"total_cost_usd"' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)" | |
| grep '"costUSD"\|"cost_usd"\|"total_cost_usd"\|Total cost' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmark/entrypoint.threefile.sh` at line 54, The grep used to extract cost
("grep '\"costUSD\"\\|\"total_cost_usd\"' /tmp/claude_raw.txt") is too narrow
and misses variants like cost_usd or different casing; update the command in the
entrypoint.threefile.sh pipeline to use an extended, case-insensitive regex that
matches common variants (e.g. costUSD, cost_usd, total_cost_usd, totalCostUsd) —
for example replace the grep with grep -E -i and a pattern like
"cost(_|)usd|total(_|)cost(_|)usd" so the cost extraction in that line reliably
finds all expected key names.
| echo | ||
| echo "==> Comparing results..." | ||
| "$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" | ||
| "$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" "$RESULTS_DIR/threefile.txt" |
There was a problem hiding this comment.
Line [76] passes a third log, but compare.sh currently only compares two inputs.
Right now benchmark/compare.sh reads only $1 and $2, so threefile.txt won’t actually be included in the report. That makes the final comparison incomplete.
💡 Quick, low-risk fix in benchmark/run.sh (until compare.sh is extended)
echo
echo "==> Comparing results..."
-"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" "$RESULTS_DIR/threefile.txt"
+echo "==> Comparing naked vs supermodel..."
+"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"
+echo
+echo "==> Comparing naked vs threefile..."
+"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/threefile.txt"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" "$RESULTS_DIR/threefile.txt" | |
| echo | |
| echo "==> Comparing results..." | |
| echo "==> Comparing naked vs supermodel..." | |
| "$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" | |
| echo | |
| echo "==> Comparing naked vs threefile..." | |
| "$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/threefile.txt" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmark/run.sh` at line 76, The call to "$SCRIPT_DIR/compare.sh" on line
with "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"
"$RESULTS_DIR/threefile.txt" passes three files but compare.sh only reads $1 and
$2, so include the third file by invoking compare.sh for each pair instead of a
single three-arg call; update run.sh to call "$SCRIPT_DIR/compare.sh" with
("$RESULTS_DIR/naked.txt","$RESULTS_DIR/supermodel.txt"), then again with
("$RESULTS_DIR/naked.txt","$RESULTS_DIR/threefile.txt") and/or
("$RESULTS_DIR/supermodel.txt","$RESULTS_DIR/threefile.txt") as needed so all
comparisons are produced without changing compare.sh.
Adds a third Docker container to Grey's benchmark harness:
run.shnow builds and runs all three sequentially.New files:
Dockerfile.threefile— builds from source with--three-fileflagentrypoint.threefile.sh— runssupermodel analyze --three-filebefore claudeCLAUDE.threefile.md— agent prompt explaining the three-file formatRun:
ANTHROPIC_API_KEY=... SUPERMODEL_API_KEY=... ./benchmark/run.shSummary by CodeRabbit
Release Notes
New Features
Chores