Skip to content

Disable prefix caching + chunked prefill in spec_compression_stress.sh#35

Merged
YWHyuk merged 1 commit into
mainfrom
claude/stress-disable-prefix-chunked
May 18, 2026
Merged

Disable prefix caching + chunked prefill in spec_compression_stress.sh#35
YWHyuk merged 1 commit into
mainfrom
claude/stress-disable-prefix-chunked

Conversation

@YWHyuk
Copy link
Copy Markdown

@YWHyuk YWHyuk commented May 18, 2026

Summary

Pins --no-enable-prefix-caching and --no-enable-chunked-prefill for every mode in serving/spec_compression_stress.sh.

Why

The simulator's CLI defines both as BooleanOptionalAction with default=True (serving/__main__.py:112,115). The stress script never opted out, so all three modes — baseline, self_verify, cpu_verify — were implicitly running with radix-attention prefix reuse + chunked-prefill. That's not what the comparison is designed to isolate; the stress is meant to measure how spec-decode and periodic KV compression interact on a lean prefill / decode pipeline, and leaving prefix-caching + chunked-prefill on confounds the result.

Change

Two lines added to the run_mode cmd block so every dispatched mode inherits the flags:

--no-enable-prefix-caching
--no-enable-chunked-prefill

Plus one banner line so the choice shows up in the run log alongside the workload / spec / compression summary.

Test plan

  • ./serving/spec_compression_stress.sh baseline (or ./... no-arg) prints Scheduler: prefix-caching=off chunked-prefill=off in the banner.
  • Each per-mode log shows the simulator booting with prefix caching disabled and chunked prefill disabled (visible at INFO log level).
  • Per-request CSVs differ from the previous (caching-on) baseline for workloads where requests share prefixes — confirms the flag actually took effect.

Generated by Claude Code

The simulator defaults --enable-prefix-caching and
--enable-chunked-prefill to true. The stress script never opted out,
so all three modes (baseline / self_verify / cpu_verify) were
implicitly comparing radix-attention + chunked-prefill paths instead
of the lean prefill/decode pipeline the comparison is designed to
isolate.

Pin both off so the run results reflect only the spec-decode and
KV-compression deltas the stress is meant to study. Print the choice
in the run banner so it's visible in the log.
@YWHyuk YWHyuk merged commit c35e905 into main May 18, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants