feat(benchmark): add three-file shard variant to Django benchmark by jonathanpopham · Pull Request #124 · supermodeltools/cli

jonathanpopham · 2026-04-13T19:41:06Z

Adds a third Docker container to Grey's benchmark harness:

Container	Format	Files per source
bench-naked	none	0
bench-supermodel	.graph	1
bench-threefile	.calls/.deps/.impact	3

run.sh now builds and runs all three sequentially.

New files:

Dockerfile.threefile — builds from source with --three-file flag
entrypoint.threefile.sh — runs supermodel analyze --three-file before claude
CLAUDE.threefile.md — agent prompt explaining the three-file format

Run: ANTHROPIC_API_KEY=... SUPERMODEL_API_KEY=... ./benchmark/run.sh

Summary by CodeRabbit

Release Notes

New Features
- Added comprehensive benchmarking framework for analyzing source code repositories using Claude Code integration.
- New documentation guide explaining three-file shard workflow methodology for efficient code analysis.
Chores
- Extended benchmark infrastructure with Docker containerization and orchestration scripts for automated benchmark execution.

…gle-graph Adds a third Docker container (bench-threefile) that runs supermodel analyze --three-file to generate .calls/.deps/.impact shards instead of single .graph files. run.sh now builds and runs all three containers sequentially: 1. bench-naked (baseline, no graph data) 2. bench-supermodel (single .graph files) 3. bench-threefile (three separate .calls/.deps/.impact files) Benchmark data shows three-file format is 68% faster than baseline and significantly outperforms single .graph on Medusa (68K nodes).

coderabbitai · 2026-04-13T19:41:20Z

Walkthrough

A new benchmarking pipeline is added for testing Claude Code analysis against Django using supermodel's three-file shard feature. The changes include documentation, Docker containerization, an orchestration script, and updates to the main benchmark runner to execute this new variant alongside existing benchmarks.

Changes

Cohort / File(s)	Summary
Three-file benchmark documentation and container setup `benchmark/CLAUDE.threefile.md`, `benchmark/Dockerfile.threefile`, `benchmark/entrypoint.threefile.sh`	Adds documentation explaining the three-file shard workflow (`.calls.py`, `.deps.py`, `.impact.py` files), a two-stage Dockerfile that compiles supermodel and installs Claude Code in a Python 3.12 environment, and a bash entrypoint script that orchestrates the benchmark: runs initial Django tests, executes supermodel analysis, invokes Claude with tool hooks for external commands, re-runs tests, and extracts cost metrics from Claude logs.
Benchmark orchestration updates `benchmark/run.sh`	Extends the main benchmark runner to build and execute the new `bench-threefile` Docker container, passing API credentials, capturing results to a separate output file, and including it in the final benchmark comparison.

Sequence Diagram

sequenceDiagram
    participant User
    participant run.sh as Benchmark Runner
    participant Docker as Docker Container
    participant Tests as Django Tests
    participant supermodel as supermodel CLI
    participant Claude as Claude API
    participant Log as Output Files

    User->>run.sh: Execute benchmark
    run.sh->>Docker: Build Dockerfile.threefile
    run.sh->>Docker: Run container (pass API keys)
    
    Docker->>Tests: Run initial test suite
    Tests-->>Log: Record failures
    
    Docker->>supermodel: Run analyze --three-file /app
    supermodel-->>Docker: Generate shards (calls, deps, impact)
    Docker-->>Log: Save supermodel output
    
    Docker->>Claude: Execute claude with task.md
    Claude->>Claude: Use tool hooks (supermodel hook)
    Claude-->>Log: Stream results + save raw logs
    
    Docker->>Tests: Re-run test suite
    Tests-->>Log: Record final results
    
    Docker->>Log: Extract cost metrics
    Docker-->>Log: Print COST SUMMARY
    
    run.sh->>Log: Compare all benchmark results
    run.sh-->>User: Display comparison

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

feat: add --three-file flag to generate .calls/.deps/.impact shards #120: Implements the --three-file shard analysis feature that this PR's benchmark infrastructure relies on and exercises.

Suggested reviewers

greynewell

Poem

🔧 Three files shard the Django tome,
Claude now works from shards at home,
Hooks and harnesses align,
Benchmarks test the three-file line,
Cost extracted, wisdom mined! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The description covers the key changes with a helpful table and lists new files, but lacks structured sections and testing guidance from the template.	Consider adding explicit sections (What/Why/Test plan) from the template, including whether make test and make lint were run, and what manual testing was performed.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly describes the main change: adding a three-file shard variant to the Django benchmark harness, which matches the PR's primary objective.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

benchmark/Dockerfile.threefile (1)
18-21: Add --no-install-recommends to apt-get install calls for a leaner image.

Lines 18 and 20 install extra recommended packages by default. This bloats the Docker image with unnecessary dependencies that aren't actually needed for the benchmark to run. By adding --no-install-recommends, you're telling apt: "give me just what I asked for, nothing extra."

Think of it like ordering a burger—you only want the burger, not the upselling of fries, drink, and a toy. Same idea here.
♻️ Suggested fix
-RUN apt-get update && apt-get install -y curl ca-certificates git && \
+RUN apt-get update && apt-get install -y --no-install-recommends curl ca-certificates git && \
     curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
-    apt-get install -y nodejs && \
+    apt-get install -y --no-install-recommends nodejs && \
     rm -rf /var/lib/apt/lists/*
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/Dockerfile.threefile` around lines 18 - 21, Update the RUN
instruction that performs apt-get install to use --no-install-recommends: modify
the apt-get install invocations in the RUN line (the segments that install "curl
ca-certificates git" and "nodejs") to include --no-install-recommends so apt-get
install -y --no-install-recommends curl ca-certificates git and apt-get install
-y --no-install-recommends nodejs, keeping the existing apt-get update, node
setup step, and cleanup (rm -rf /var/lib/apt/lists/*) intact.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmark/entrypoint.threefile.sh`:
- Line 54: The grep used to extract cost ("grep
'\"costUSD\"\\|\"total_cost_usd\"' /tmp/claude_raw.txt") is too narrow and
misses variants like cost_usd or different casing; update the command in the
entrypoint.threefile.sh pipeline to use an extended, case-insensitive regex that
matches common variants (e.g. costUSD, cost_usd, total_cost_usd, totalCostUsd) —
for example replace the grep with grep -E -i and a pattern like
"cost(_|)usd|total(_|)cost(_|)usd" so the cost extraction in that line reliably
finds all expected key names.

In `@benchmark/run.sh`:
- Line 76: The call to "$SCRIPT_DIR/compare.sh" on line with
"$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"
"$RESULTS_DIR/threefile.txt" passes three files but compare.sh only reads $1 and
$2, so include the third file by invoking compare.sh for each pair instead of a
single three-arg call; update run.sh to call "$SCRIPT_DIR/compare.sh" with
("$RESULTS_DIR/naked.txt","$RESULTS_DIR/supermodel.txt"), then again with
("$RESULTS_DIR/naked.txt","$RESULTS_DIR/threefile.txt") and/or
("$RESULTS_DIR/supermodel.txt","$RESULTS_DIR/threefile.txt") as needed so all
comparisons are produced without changing compare.sh.

---

Nitpick comments:
In `@benchmark/Dockerfile.threefile`:
- Around line 18-21: Update the RUN instruction that performs apt-get install to
use --no-install-recommends: modify the apt-get install invocations in the RUN
line (the segments that install "curl ca-certificates git" and "nodejs") to
include --no-install-recommends so apt-get install -y --no-install-recommends
curl ca-certificates git and apt-get install -y --no-install-recommends nodejs,
keeping the existing apt-get update, node setup step, and cleanup (rm -rf
/var/lib/apt/lists/*) intact.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 76587135-b5ee-4204-842c-df8c2042404b

📥 Commits

Reviewing files that changed from the base of the PR and between ba66dcd and 7e3a43c.

📒 Files selected for processing (4)

benchmark/CLAUDE.threefile.md
benchmark/Dockerfile.threefile
benchmark/entrypoint.threefile.sh
benchmark/run.sh

coderabbitai · 2026-04-13T19:45:28Z

benchmark/entrypoint.threefile.sh

+echo "============================================================"
+echo "COST SUMMARY"
+echo "============================================================"
+grep '"costUSD"\|"total_cost_usd"' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"


⚠️ Potential issue | 🟡 Minor

Line [54] cost extraction is narrower than the existing parser logic.

This can miss valid outputs (for example cost_usd), which makes the summary look empty even when cost exists.

🛠️ Suggested robustness update

-grep '"costUSD"\|"total_cost_usd"' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)" +grep '"costUSD"\|"cost_usd"\|"total_cost_usd"\|Total cost' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

grep '"costUSD"\|"total_cost_usd"' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"

grep '"costUSD"\|"cost_usd"\|"total_cost_usd"\|Total cost' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmark/entrypoint.threefile.sh` at line 54, The grep used to extract cost ("grep '\"costUSD\"\\|\"total_cost_usd\"' /tmp/claude_raw.txt") is too narrow and misses variants like cost_usd or different casing; update the command in the entrypoint.threefile.sh pipeline to use an extended, case-insensitive regex that matches common variants (e.g. costUSD, cost_usd, total_cost_usd, totalCostUsd) — for example replace the grep with grep -E -i and a pattern like "cost(_|)usd|total(_|)cost(_|)usd" so the cost extraction in that line reliably finds all expected key names.

coderabbitai · 2026-04-13T19:45:29Z

benchmark/run.sh

 echo
 echo "==> Comparing results..."
-"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"
+"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" "$RESULTS_DIR/threefile.txt"


⚠️ Potential issue | 🟠 Major

Line [76] passes a third log, but compare.sh currently only compares two inputs.

Right now benchmark/compare.sh reads only $1 and $2, so threefile.txt won’t actually be included in the report. That makes the final comparison incomplete.

💡 Quick, low-risk fix in benchmark/run.sh (until compare.sh is extended)

echo echo "==> Comparing results..." -"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" "$RESULTS_DIR/threefile.txt" +echo "==> Comparing naked vs supermodel..." +"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" +echo +echo "==> Comparing naked vs threefile..." +"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/threefile.txt"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" "$RESULTS_DIR/threefile.txt"

echo

echo "==> Comparing results..."

echo "==> Comparing naked vs supermodel..."

"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"

echo

echo "==> Comparing naked vs threefile..."

"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/threefile.txt"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmark/run.sh` at line 76, The call to "$SCRIPT_DIR/compare.sh" on line with "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" "$RESULTS_DIR/threefile.txt" passes three files but compare.sh only reads $1 and $2, so include the third file by invoking compare.sh for each pair instead of a single three-arg call; update run.sh to call "$SCRIPT_DIR/compare.sh" with ("$RESULTS_DIR/naked.txt","$RESULTS_DIR/supermodel.txt"), then again with ("$RESULTS_DIR/naked.txt","$RESULTS_DIR/threefile.txt") and/or ("$RESULTS_DIR/supermodel.txt","$RESULTS_DIR/threefile.txt") as needed so all comparisons are produced without changing compare.sh.

jonathanpopham requested a review from greynewell as a code owner April 13, 2026 19:41

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

greynewell merged commit 5552bb6 into supermodeltools:main Apr 13, 2026
3 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): add three-file shard variant to Django benchmark#124

feat(benchmark): add three-file shard variant to Django benchmark#124
greynewell merged 1 commit intosupermodeltools:mainfrom
jonathanpopham:feat/benchmark-threefile

jonathanpopham commented Apr 13, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 13, 2026 •

edited

Loading

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 13, 2026

Uh oh!

coderabbitai bot Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	grep '"costUSD"\\|"total_cost_usd"' /tmp/claude_raw.txt 2>/dev/null \| tail -3 \|\| echo "(check log)"
	grep '"costUSD"\\|"cost_usd"\\|"total_cost_usd"\\|Total cost' /tmp/claude_raw.txt 2>/dev/null \| tail -3 \|\| echo "(check log)"

Conversation

jonathanpopham commented Apr 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jonathanpopham commented Apr 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 13, 2026 •

edited

Loading