This repo tests one hypothesis: good skills and context engineering can help lower models perform closer to stronger models.
- Read
.codex/skills/tdd/SKILL.mdto understand the TDD behavior being tested. - Read
.codex/skills/eval-result/SKILL.mdto understand how coder evidence is packaged. - Read
.codex/agents/*.yamlto see which model each coder agent uses. - Read
.codex/skills/skill-eval/references/rubric.mdso you know how results are judged.
- Codex with agent/subagent support.
- Git with a clean baseline commit.
- Basic tooling for your chosen stack, for example JS/npm, Python, or Go.
- Ask Codex:
Use $env-setup to prepare a TDD benchmark for <stack/problem>. - Answer the setup questions for stack, problem, commands, constraints, and acceptance criteria.
- Confirm
promptand.codex/tdd-setup.jsonwere created. - Commit or snapshot the clean baseline.
- Ask Codex:
Use $skill-eval to start eval. - Ask Codex:
Use $skill-eval to show report.
$skill-eval runs junior, mid, and senior coder agents in isolated worktrees. Each coder uses $tdd and $eval-result; the evaluator scores outputs against the senior baseline.
To iterate, tune $tdd from report feedback, commit the skill change, throw away generated run state, and run the flow again.
Note: this is experimental and built for Codex. It has not been verified on other agent systems.