Skip to content

docs(benchmarks): TB 2.1 MiniMax-M3 v0.33.0 re-run (46/89 · 51.7%)#113

Merged
Fullstop000 merged 1 commit into
masterfrom
docs/bench-tb21-mm3-v033
Jun 3, 2026
Merged

docs(benchmarks): TB 2.1 MiniMax-M3 v0.33.0 re-run (46/89 · 51.7%)#113
Fullstop000 merged 1 commit into
masterfrom
docs/bench-tb21-mm3-v033

Conversation

@Fullstop000
Copy link
Copy Markdown
Owner

Second TB 2.1 MiniMax-M3 baseline, this time on the binary that includes PR #97's stream-drop auto-retry. Compared against PR #98's 06-02 baseline:

v0.32.0 (06-02) v0.33.0 (06-03) Δ
Score 42/89 · 47.2% 46/89 · 51.7% +4.5 pp
Resolved% 57.5% 53.5% -4.0 pp
Errored 16 (all ConnectionDropped) 3 (2 ConnectionDropped + 1 VerifierTimeoutError) -13
Failed 31 (16 reject + 15 TimedOut) 40 (28 reject + 12 TimedOut) +9
Input tok 47.9M 91.9M +92%
Cache hit 86.1% 93.8% +7.7 pp

PR #97 cut ConnectionDropped 16 → 2 (-87.5%) — the goal of that change. Resolved% dipped because the recovered trials weren't all easy ones (denominator grew faster than numerator).

Token cost is the bill for auto-retry: replays send context again on each attempt. Cache-hit rose to 93.8%, so most of the replayed tokens were cache reads.

Files

The Vercel HTML report is already live at https://ignis-bench-reports.vercel.app/reports/tb21-minimax-m3-20260603 (separate repo).

Test plan

  • aggregate_results.py → 46p/40f/3e, sub-types TimedOut:12 / ConnectionDropped:2 / VerifierTimeoutError:1
  • HTML report generated + secret-scanned clean
  • CI green

Second TB 2.1 MiniMax-M3 run, this time on a binary that includes PR #97's
stream-drop auto-retry. Headline: ConnectionDropped count fell from 16 to 2
(-87.5%) and score moved +4.5 pp (47.2% → 51.7%). Resolved% slipped -4.0 pp
because the recovered trials were mostly non-trivial — they enlarged the
denominator faster than the numerator.
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.17%. Comparing base (df8c591) to head (a62ecaa).

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #113   +/-   ##
=======================================
  Coverage   74.17%   74.17%           
=======================================
  Files          65       65           
  Lines       16798    16798           
=======================================
  Hits        12460    12460           
  Misses       4338     4338           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Fullstop000 Fullstop000 merged commit 4602d93 into master Jun 3, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants