Skip to content

Cache get_config by model_name#38

Merged
YWHyuk merged 1 commit into
mainfrom
claude/cache-get-config
May 19, 2026
Merged

Cache get_config by model_name#38
YWHyuk merged 1 commit into
mainfrom
claude/cache-get-config

Conversation

@YWHyuk
Copy link
Copy Markdown

@YWHyuk YWHyuk commented May 19, 2026

Summary

get_config() (serving/core/utils.py:58) opens and parses configs/model/<model_name>.json on every call with no caching. calculate_sizes() (serving/core/memory_model.py:797) calls get_config() unconditionally as its first line, and is itself called ~20-30 times per scheduler iteration -- once directly from _emit_layer and once indirectly via XPURooflineModel._traffic_bytes -- so each iteration was re-reading the same Llama-3.1-8B.json (or whichever model) from disk many times.

This patch adds a small module-level _config_cache keyed by model_name, mirroring the existing _arch_cache / _perf_db_cache pattern in serving/core/trace_generator.py (PRs #36 / #37 fixed the same shape of bug for the architecture yaml and chakra subprocess).

Measurement

Setup: NUM_REQ=4 PROMPT_LEN=128 OUTPUT_LEN=128 ./serving/spec_compression_stress.sh baseline (129 scheduler iterations, Llama-3.1-8B, --analytical-modeling).

Wall-clock (5 runs each, after warmup):

Mean Range
Before (main) 1.57 s 1.51 - 1.65 s
After 1.36 s 1.31 - 1.44 s

-> -210 ms (-13%) on this workload.

pyinstrument breakdown of generate_trace (the affected path):

Before After
_build_transformer_block 158 ms 93 ms
_emit_pre_attn_layers._emit_layer 78 ms (~half in calculate_sizes / get_config) 43 ms
_resolve_layer_latency -> _traffic_bytes -> calculate_sizes 33 ms 20 ms
_emit_final_layers ~50 ms ~10 ms

Total clocks (ns) (144012083 on the 33-iter run, 579297238 on the 129-iter run) and Mean TTFT / TPOT / ITL are bit-identical before and after.

The cache key is model_name only and the JSON on disk is read-only during a simulation run, so there is no invalidation concern. Multi-instance runs that share a model share the parsed dict.

Test plan

  • NUM_REQ=2 PROMPT_LEN=64 OUTPUT_LEN=32 ./serving/spec_compression_stress.sh baseline -- Total clocks = 144012083 unchanged across 3 runs
  • NUM_REQ=4 PROMPT_LEN=128 OUTPUT_LEN=128 ./serving/spec_compression_stress.sh baseline -- Total clocks = 579297238 unchanged across 2 runs; -13% wall-clock
  • Spot-check self_verify / cpu_verify modes
  • Spot-check a multi-instance / DP run -- two instances loading the same model now share one parsed config dict

Generated by Claude Code

get_config() opens and parses configs/model/<model_name>.json on every
call with no caching. calculate_sizes() (memory_model.py:797) calls
get_config() unconditionally as its first line, and is itself called
~20-30 times per scheduler iteration (per layer, once directly from
_emit_layer and once indirectly via XPURooflineModel._traffic_bytes).

For a 129-iter --analytical-modeling baseline on Llama-3.1-8B, this
showed up in the pyinstrument profile as ~100 ms across the
calculate_sizes / _resolve_layer_latency call sites — same pattern as
the _load_architecture (PR #36) and inline-chakra (PR #37) fixes.

Adds a small module-level _config_cache mirroring _arch_cache /
_perf_db_cache in trace_generator.py.

Wall-clock on 129-iter baseline (5 runs, mean ± stdev):
  before: 1.57 s ± 0.06
  after:  1.36 s ± 0.05    (-210 ms, -13%)

Sim-time output (Total clocks, Mean TTFT/TPOT/ITL) unchanged.
@YWHyuk YWHyuk merged commit 83b8857 into main May 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants