Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions benchmark/CLAUDE.threefile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Django Source — supermodel three-file shards enabled

This is the Django framework source. The auth package is at `django/contrib/auth/`.

## Graph shard files

`supermodel analyze --three-file` has run on this repo. Every source file has
three shard files with pre-computed context:

- `.calls.py` — function call relationships (who calls what, with file and line number)
- `.deps.py` — import dependencies (what this file imports and what imports it)
- `.impact.py` — blast radius (risk level, affected domains, direct/transitive dependents)

**Read the shard files before the source file.** They show you the full
picture in far fewer tokens. For example:

- Wondering what `django/contrib/auth/__init__.py` calls and who calls it?
→ read `django/contrib/auth/__init__.calls.py`
- Need to know what this module depends on?
→ read `django/contrib/auth/__init__.deps.py`
- Want to assess blast radius before changing something?
→ read `django/contrib/auth/__init__.impact.py`

Use the shard files to navigate efficiently. Only drop into the source when you
need implementation details the shards don't cover.
46 changes: 46 additions & 0 deletions benchmark/Dockerfile.threefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Benchmark container: Claude Code + supermodel --three-file on django/django
# Build from repo root: docker build -f benchmark/Dockerfile.threefile -t bench-threefile .

# Stage 1: Build supermodel binary
FROM golang:alpine AS supermodel-builder
ENV GOTOOLCHAIN=auto
WORKDIR /build
COPY . .
RUN go build \
-ldflags="-s -w -X github.com/supermodeltools/cli/internal/build.Version=benchmark" \
-o /build/supermodel \
.

# Stage 2: Runtime
FROM python:3.12-slim

# System deps + Node.js 20
RUN apt-get update && apt-get install -y curl ca-certificates git && \
curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
apt-get install -y nodejs && \
rm -rf /var/lib/apt/lists/*

# Install Claude Code + supermodel
RUN npm install -g @anthropic-ai/claude-code
COPY --from=supermodel-builder /build/supermodel /usr/local/bin/supermodel

# Clone Django source at a fixed tag
RUN git clone --depth=1 --branch 5.0.6 \
https://github.com/django/django.git /app

# Install Django in editable mode
RUN pip install --no-cache-dir -e /app

# Drop in the change_tracking test app
COPY benchmark/change_tracking/ /app/tests/change_tracking/

# Copy task + CLAUDE.md
COPY benchmark/task.md /benchmark/task.md
COPY benchmark/CLAUDE.threefile.md /app/CLAUDE.md

# Non-root user
RUN useradd -m -s /bin/bash bench && chown -R bench:bench /app /benchmark
USER bench

COPY benchmark/entrypoint.threefile.sh /entrypoint.sh
ENTRYPOINT ["/bin/bash", "/entrypoint.sh"]
54 changes: 54 additions & 0 deletions benchmark/entrypoint.threefile.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#!/bin/bash
set -euo pipefail

RUN_TESTS="python tests/runtests.py --settings=test_sqlite change_tracking"

echo "============================================================"
echo "BENCHMARK: Claude Code + supermodel --three-file — django/django"
echo "============================================================"
echo

echo "--- Initial test run (all 8 should FAIL/ERROR) ---"
cd /app
PYTHONPATH=tests $RUN_TESTS -v 0 2>&1 | tail -3 || true
echo

echo "--- Running supermodel analyze --three-file ---"
supermodel analyze --three-file /app 2>&1 | tee /tmp/supermodel_analyze.txt
echo

echo "--- Wiring supermodel hook ---"
mkdir -p ~/.claude
cat > ~/.claude/settings.json <<'JSON'
{
"hooks": {
"PostToolUse": [
{
"matcher": "Write|Edit",
"hooks": [{ "type": "command", "command": "supermodel hook" }]
}
]
}
}
JSON

echo "--- Running Claude Code on task ---"
cd /app
claude \
--print "$(cat /benchmark/task.md)" \
--dangerously-skip-permissions \
--output-format stream-json \
--verbose \
2>&1 | tee /tmp/claude_raw.txt

echo
echo "============================================================"
echo "TEST RESULTS"
echo "============================================================"
PYTHONPATH=tests $RUN_TESTS -v 2 2>&1 | tee /tmp/test_results.txt

echo
echo "============================================================"
echo "COST SUMMARY"
echo "============================================================"
grep '"costUSD"\|"total_cost_usd"' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Line [54] cost extraction is narrower than the existing parser logic.

This can miss valid outputs (for example cost_usd), which makes the summary look empty even when cost exists.

🛠️ Suggested robustness update
-grep '"costUSD"\|"total_cost_usd"' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"
+grep '"costUSD"\|"cost_usd"\|"total_cost_usd"\|Total cost' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
grep '"costUSD"\|"total_cost_usd"' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"
grep '"costUSD"\|"cost_usd"\|"total_cost_usd"\|Total cost' /tmp/claude_raw.txt 2>/dev/null | tail -3 || echo "(check log)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/entrypoint.threefile.sh` at line 54, The grep used to extract cost
("grep '\"costUSD\"\\|\"total_cost_usd\"' /tmp/claude_raw.txt") is too narrow
and misses variants like cost_usd or different casing; update the command in the
entrypoint.threefile.sh pipeline to use an extended, case-insensitive regex that
matches common variants (e.g. costUSD, cost_usd, total_cost_usd, totalCostUsd) —
for example replace the grep with grep -E -i and a pattern like
"cost(_|)usd|total(_|)cost(_|)usd" so the cost extraction in that line reliably
finds all expected key names.

17 changes: 16 additions & 1 deletion benchmark/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,13 @@ docker build \
"$REPO_ROOT" \
2>&1 | tail -3

echo "==> Building bench-threefile (three-file shard format)..."
docker build \
-f "$SCRIPT_DIR/Dockerfile.threefile" \
-t bench-threefile \
"$REPO_ROOT" \
2>&1 | tail -3

echo

# ── Run containers ────────────────────────────────────────────────────────────
Expand All @@ -56,6 +63,14 @@ docker run --rm \
bench-supermodel \
2>&1 | tee "$RESULTS_DIR/supermodel.txt"

echo
echo "==> Running three-file container..."
docker run --rm \
-e ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
-e SUPERMODEL_API_KEY="$SUPERMODEL_API_KEY" \
bench-threefile \
2>&1 | tee "$RESULTS_DIR/threefile.txt"

echo
echo "==> Comparing results..."
"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"
"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" "$RESULTS_DIR/threefile.txt"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Line [76] passes a third log, but compare.sh currently only compares two inputs.

Right now benchmark/compare.sh reads only $1 and $2, so threefile.txt won’t actually be included in the report. That makes the final comparison incomplete.

💡 Quick, low-risk fix in benchmark/run.sh (until compare.sh is extended)
 echo
 echo "==> Comparing results..."
-"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" "$RESULTS_DIR/threefile.txt"
+echo "==> Comparing naked vs supermodel..."
+"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"
+echo
+echo "==> Comparing naked vs threefile..."
+"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/threefile.txt"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt" "$RESULTS_DIR/threefile.txt"
echo
echo "==> Comparing results..."
echo "==> Comparing naked vs supermodel..."
"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"
echo
echo "==> Comparing naked vs threefile..."
"$SCRIPT_DIR/compare.sh" "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/threefile.txt"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/run.sh` at line 76, The call to "$SCRIPT_DIR/compare.sh" on line
with "$RESULTS_DIR/naked.txt" "$RESULTS_DIR/supermodel.txt"
"$RESULTS_DIR/threefile.txt" passes three files but compare.sh only reads $1 and
$2, so include the third file by invoking compare.sh for each pair instead of a
single three-arg call; update run.sh to call "$SCRIPT_DIR/compare.sh" with
("$RESULTS_DIR/naked.txt","$RESULTS_DIR/supermodel.txt"), then again with
("$RESULTS_DIR/naked.txt","$RESULTS_DIR/threefile.txt") and/or
("$RESULTS_DIR/supermodel.txt","$RESULTS_DIR/threefile.txt") as needed so all
comparisons are produced without changing compare.sh.

Loading