Cache get_config by model_name by YWHyuk · Pull Request #38 · PSAL-POSTECH/LLMServingSimSpec

YWHyuk · 2026-05-19T05:20:53Z

Summary

get_config() (serving/core/utils.py:58) opens and parses configs/model/<model_name>.json on every call with no caching. calculate_sizes() (serving/core/memory_model.py:797) calls get_config() unconditionally as its first line, and is itself called ~20-30 times per scheduler iteration -- once directly from _emit_layer and once indirectly via XPURooflineModel._traffic_bytes -- so each iteration was re-reading the same Llama-3.1-8B.json (or whichever model) from disk many times.

This patch adds a small module-level _config_cache keyed by model_name, mirroring the existing _arch_cache / _perf_db_cache pattern in serving/core/trace_generator.py (PRs #36 / #37 fixed the same shape of bug for the architecture yaml and chakra subprocess).

Measurement

Setup: NUM_REQ=4 PROMPT_LEN=128 OUTPUT_LEN=128 ./serving/spec_compression_stress.sh baseline (129 scheduler iterations, Llama-3.1-8B, --analytical-modeling).

Wall-clock (5 runs each, after warmup):

	Mean	Range
Before (main)	1.57 s	1.51 - 1.65 s
After	1.36 s	1.31 - 1.44 s

-> -210 ms (-13%) on this workload.

pyinstrument breakdown of generate_trace (the affected path):

	Before	After
`_build_transformer_block`	158 ms	93 ms
↳ `_emit_pre_attn_layers._emit_layer`	78 ms (~half in `calculate_sizes` / `get_config`)	43 ms
↳ `_resolve_layer_latency` -> `_traffic_bytes` -> `calculate_sizes`	33 ms	20 ms
`_emit_final_layers`	~50 ms	~10 ms

Total clocks (ns) (144012083 on the 33-iter run, 579297238 on the 129-iter run) and Mean TTFT / TPOT / ITL are bit-identical before and after.

The cache key is model_name only and the JSON on disk is read-only during a simulation run, so there is no invalidation concern. Multi-instance runs that share a model share the parsed dict.

Test plan

NUM_REQ=2 PROMPT_LEN=64 OUTPUT_LEN=32 ./serving/spec_compression_stress.sh baseline -- Total clocks = 144012083 unchanged across 3 runs
NUM_REQ=4 PROMPT_LEN=128 OUTPUT_LEN=128 ./serving/spec_compression_stress.sh baseline -- Total clocks = 579297238 unchanged across 2 runs; -13% wall-clock
Spot-check self_verify / cpu_verify modes
Spot-check a multi-instance / DP run -- two instances loading the same model now share one parsed config dict

Generated by Claude Code

get_config() opens and parses configs/model/<model_name>.json on every call with no caching. calculate_sizes() (memory_model.py:797) calls get_config() unconditionally as its first line, and is itself called ~20-30 times per scheduler iteration (per layer, once directly from _emit_layer and once indirectly via XPURooflineModel._traffic_bytes). For a 129-iter --analytical-modeling baseline on Llama-3.1-8B, this showed up in the pyinstrument profile as ~100 ms across the calculate_sizes / _resolve_layer_latency call sites — same pattern as the _load_architecture (PR #36) and inline-chakra (PR #37) fixes. Adds a small module-level _config_cache mirroring _arch_cache / _perf_db_cache in trace_generator.py. Wall-clock on 129-iter baseline (5 runs, mean ± stdev): before: 1.57 s ± 0.06 after: 1.36 s ± 0.05 (-210 ms, -13%) Sim-time output (Total clocks, Mean TTFT/TPOT/ITL) unchanged.

YWHyuk merged commit 83b8857 into main May 19, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache get_config by model_name#38

Cache get_config by model_name#38
YWHyuk merged 1 commit into
mainfrom
claude/cache-get-config

YWHyuk commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YWHyuk commented May 19, 2026

Summary

Measurement

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants