Skip to content

feat(benchmark): add three-file shard variant to Django benchmark#124

Merged
greynewell merged 1 commit intosupermodeltools:mainfrom
jonathanpopham:feat/benchmark-threefile
Apr 13, 2026
Merged

feat(benchmark): add three-file shard variant to Django benchmark#124
greynewell merged 1 commit intosupermodeltools:mainfrom
jonathanpopham:feat/benchmark-threefile

Conversation

@jonathanpopham
Copy link
Copy Markdown
Contributor

@jonathanpopham jonathanpopham commented Apr 13, 2026

Adds a third Docker container to Grey's benchmark harness:

Container Format Files per source
bench-naked none 0
bench-supermodel .graph 1
bench-threefile .calls/.deps/.impact 3

run.sh now builds and runs all three sequentially.

New files:

  • Dockerfile.threefile — builds from source with --three-file flag
  • entrypoint.threefile.sh — runs supermodel analyze --three-file before claude
  • CLAUDE.threefile.md — agent prompt explaining the three-file format

Run: ANTHROPIC_API_KEY=... SUPERMODEL_API_KEY=... ./benchmark/run.sh

Summary by CodeRabbit

Release Notes

  • New Features

    • Added comprehensive benchmarking framework for analyzing source code repositories using Claude Code integration.
    • New documentation guide explaining three-file shard workflow methodology for efficient code analysis.
  • Chores

    • Extended benchmark infrastructure with Docker containerization and orchestration scripts for automated benchmark execution.

…gle-graph

Adds a third Docker container (bench-threefile) that runs supermodel
analyze --three-file to generate .calls/.deps/.impact shards instead
of single .graph files.

run.sh now builds and runs all three containers sequentially:
1. bench-naked (baseline, no graph data)
2. bench-supermodel (single .graph files)
3. bench-threefile (three separate .calls/.deps/.impact files)

Benchmark data shows three-file format is 68% faster than baseline
and significantly outperforms single .graph on Medusa (68K nodes).
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

Walkthrough

A new benchmarking pipeline is added for testing Claude Code analysis against Django using supermodel's three-file shard feature. The changes include documentation, Docker containerization, an orchestration script, and updates to the main benchmark runner to execute this new variant alongside existing benchmarks.

Changes

Cohort / File(s) Summary
Three-file benchmark documentation and container setup
benchmark/CLAUDE.threefile.md, benchmark/Dockerfile.threefile, benchmark/entrypoint.threefile.sh
Adds documentation explaining the three-file shard workflow (.calls.py, .deps.py, .impact.py files), a two-stage Dockerfile that compiles supermodel and installs Claude Code in a Python 3.12 environment, and a bash entrypoint script that orchestrates the benchmark: runs initial Django tests, executes supermodel analysis, invokes Claude with tool hooks for external commands, re-runs tests, and extracts cost metrics from Claude logs.
Benchmark orchestration updates
benchmark/run.sh
Extends the main benchmark runner to build and execute the new bench-threefile Docker container, passing API credentials, capturing results to a separate output file, and including it in the final benchmark comparison.

Sequence Diagram

sequenceDiagram
    participant User
    participant run.sh as Benchmark Runner
    participant Docker as Docker Container
    participant Tests as Django Tests
    participant supermodel as supermodel CLI
    participant Claude as Claude API
    participant Log as Output Files

    User->>run.sh: Execute benchmark
    run.sh->>Docker: Build Dockerfile.threefile
    run.sh->>Docker: Run container (pass API keys)
    
    Docker->>Tests: Run initial test suite
    Tests-->>Log: Record failures
    
    Docker->>supermodel: Run analyze --three-file /app
    supermodel-->>Docker: Generate shards (calls, deps, impact)
    Docker-->>Log: Save supermodel output
    
    Docker->>Claude: Execute claude with task.md
    Claude->>Claude: Use tool hooks (supermodel hook)
    Claude-->>Log: Stream results + save raw logs
    
    Docker->>Tests: Re-run test suite
    Tests-->>Log: Record final results
    
    Docker->>Log: Extract cost metrics
    Docker-->>Log: Print COST SUMMARY
    
    run.sh->>Log: Compare all benchmark results
    run.sh-->>User: Display comparison
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested reviewers

  • greynewell

Poem

🔧 Three files shard the Django tome,
Claude now works from shards at home,
Hooks and harnesses align,
Benchmarks test the three-file line,
Cost extracted, wisdom mined! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description covers the key changes with a helpful table and lists new files, but lacks structured sections and testing guidance from the template. Consider adding explicit sections (What/Why/Test plan) from the template, including whether make test and make lint were run, and what manual testing was performed.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the main change: adding a three-file shard variant to the Django benchmark harness, which matches the PR's primary objective.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
benchmark/Dockerfile.threefile (1)

18-21: Add --no-install-recommends to apt-get install calls for a leaner image.

Lines 18 and 20 install extra recommended packages by default. This bloats the Docker image with unnecessary dependencies that aren't actually needed for the benchmark to run. By adding --no-install-recommends, you're telling apt: "give me just what I asked for, nothing extra."

Think of it like ordering a burger—you only want the burger, not the upselling of fries, drink, and a toy. Same idea here.

♻️ Suggested fix
-RUN apt-get update && apt-get install -y curl ca-certificates git && \
+RUN apt-get update && apt-get install -y --no-install-recommends curl ca-certificates git && \
     curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
-    apt-get install -y nodejs && \
+    apt-get install -y --no-install-recommends nodejs && \
     rm -rf /var/lib/apt/lists/*
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/Dockerfile.threefile` around lines 18 - 21, Update the RUN
instruction that performs apt-get install to use --no-install-recommends: modify
the apt-get install invocations in the RUN line (the segments that install "curl
ca-certificates git" and "nodejs") to include --no-install-recommends so apt-get
install -y --no-install-recommends curl ca-certificates git and apt-get install
-y --no-install-recommends nodejs, keeping the existing apt-get update, node
setup step, and cleanup (rm -rf /var/lib/apt/lists/*) intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmark/entrypoint.threefile.sh`:
- Line 54: The grep used to extract cost ("grep
'\"costUSD\"\\|\"total_cost_usd\"' /tmp/claude_raw.txt") is too narrow and
misses variants like cost_usd or different casing; update the command in the
entrypoint.threefile.sh pipeline to use an extended, case-insensitive regex that
matches common variants (e.g. costUSD, cost_usd, total_cost_usd, totalCostUsd) —
for example replace the grep with grep -E -i and a pattern like
"cost(_|)usd|total(_|)cost(_|)usd" so the cost extraction in that line reliably
finds all expected key names.

In `@benchmark/run.sh`:
- Line 76: The call to "$SCRIPT_DIR/compare.sh" on line with
"$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"
"$RESULTS_DIR/threefile.txt" passes three files but compare.sh only reads $1 and
$2, so include the third file by invoking compare.sh for each pair instead of a
single three-arg call; update run.sh to call "$SCRIPT_DIR/compare.sh" with
("$RESULTS_DIR/naked.txt","$RESULTS_DIR/supermodel.txt"), then again with
("$RESULTS_DIR/naked.txt","$RESULTS_DIR/threefile.txt") and/or
("$RESULTS_DIR/supermodel.txt","$RESULTS_DIR/threefile.txt") as needed so all
comparisons are produced without changing compare.sh.

---

Nitpick comments:
In `@benchmark/Dockerfile.threefile`:
- Around line 18-21: Update the RUN instruction that performs apt-get install to
use --no-install-recommends: modify the apt-get install invocations in the RUN
line (the segments that install "curl ca-certificates git" and "nodejs") to
include --no-install-recommends so apt-get install -y --no-install-recommends
curl ca-certificates git and apt-get install -y --no-install-recommends nodejs,
keeping the existing apt-get update, node setup step, and cleanup (rm -rf
/var/lib/apt/lists/*) intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 76587135-b5ee-4204-842c-df8c2042404b

📥 Commits

Reviewing files that changed from the base of the PR and between ba66dcd and 7e3a43c.

📒 Files selected for processing (4)
  • benchmark/CLAUDE.threefile.md
  • benchmark/Dockerfile.threefile
  • benchmark/entrypoint.threefile.sh
  • benchmark/run.sh

echo "============================================================"
echo "COST SUMMARY"
echo "============================================================"
grep '"costUSD"\|"total_cost_usd"' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Line [54] cost extraction is narrower than the existing parser logic.

This can miss valid outputs (for example cost_usd), which makes the summary look empty even when cost exists.

🛠️ Suggested robustness update
-grep '"costUSD"\|"total_cost_usd"' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"
+grep '"costUSD"\|"cost_usd"\|"total_cost_usd"\|Total cost' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
grep '"costUSD"\|"total_cost_usd"' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"
grep '"costUSD"\|"cost_usd"\|"total_cost_usd"\|Total cost' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/entrypoint.threefile.sh` at line 54, The grep used to extract cost
("grep '\"costUSD\"\\|\"total_cost_usd\"' /tmp/claude_raw.txt") is too narrow
and misses variants like cost_usd or different casing; update the command in the
entrypoint.threefile.sh pipeline to use an extended, case-insensitive regex that
matches common variants (e.g. costUSD, cost_usd, total_cost_usd, totalCostUsd) —
for example replace the grep with grep -E -i and a pattern like
"cost(_|)usd|total(_|)cost(_|)usd" so the cost extraction in that line reliably
finds all expected key names.

echo
echo "==> Comparing results..."
"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"
"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" "$RESULTS_DIR/threefile.txt"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Line [76] passes a third log, but compare.sh currently only compares two inputs.

Right now benchmark/compare.sh reads only $1 and $2, so threefile.txt won’t actually be included in the report. That makes the final comparison incomplete.

💡 Quick, low-risk fix in benchmark/run.sh (until compare.sh is extended)
 echo
 echo "==> Comparing results..."
-"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" "$RESULTS_DIR/threefile.txt"
+echo "==> Comparing naked vs supermodel..."
+"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"
+echo
+echo "==> Comparing naked vs threefile..."
+"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/threefile.txt"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" "$RESULTS_DIR/threefile.txt"
echo
echo "==> Comparing results..."
echo "==> Comparing naked vs supermodel..."
"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"
echo
echo "==> Comparing naked vs threefile..."
"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/threefile.txt"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/run.sh` at line 76, The call to "$SCRIPT_DIR/compare.sh" on line
with "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"
"$RESULTS_DIR/threefile.txt" passes three files but compare.sh only reads $1 and
$2, so include the third file by invoking compare.sh for each pair instead of a
single three-arg call; update run.sh to call "$SCRIPT_DIR/compare.sh" with
("$RESULTS_DIR/naked.txt","$RESULTS_DIR/supermodel.txt"), then again with
("$RESULTS_DIR/naked.txt","$RESULTS_DIR/threefile.txt") and/or
("$RESULTS_DIR/supermodel.txt","$RESULTS_DIR/threefile.txt") as needed so all
comparisons are produced without changing compare.sh.

@greynewell greynewell merged commit 5552bb6 into supermodeltools:main Apr 13, 2026
3 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants