Inline Chakra LLM converter instead of subprocess fork by YWHyuk · Pull Request #37 · PSAL-POSTECH/LLMServingSimSpec

YWHyuk · 2026-05-19T04:49:39Z

Summary

generate_graph() was running python -m chakra.src.converter.converter LLM ... as a fresh subprocess on every scheduler iteration. Profiling on the spec_compression_stress.sh baseline workload showed this cost ~65 ms per iteration, of which only ~7 ms was actual conversion -- the rest was Python interpreter startup (~12 ms) plus chakra module import (~47 ms).

Since chakra is already a pip-installed dependency, LLMConverter can be constructed and called in-process. The interpreter and chakra modules then load once at simulator startup instead of N times per run (N = 33 / 129 / 4097+ depending on workload).

Change

serving/core/graph_generator.py (12 / -15):

Drop import subprocess, add from chakra.src.converter.llm_converter import LLMConverter at module top
Replace the subprocess.run([...]) call with a direct LLMConverter(input_path, output_path, num_npus, npu_offset, enable_local_offloading).convert()
Wrap chdir + convert in try/finally so an exception in .convert() no longer leaks the cwd to callers (the previous subprocess.run had check=False so failures were swallowed; in-process, exceptions propagate, which is the right behavior)

Existing relative-path semantics (chdir into the chakra workspace, then use ../../../inputs/... paths) are preserved -- the same paths are now passed to the LLMConverter constructor instead of being assembled into a CLI command.

Measurement

NUM_REQ=2 PROMPT_LEN=64 OUTPUT_LEN=32 ./serving/spec_compression_stress.sh baseline (33 iters, --analytical-modeling, Llama-3.1-8B), pyinstrument breakdown:

	Before (main)	After
`generate_graph` total	2.09 s	0.093 s (~22x)
↳ per iteration	~65 ms (subprocess + import + work)	~3 ms (just `.convert()` work)
`generate_trace`	0.51 s	0.41 s (unchanged in this PR)
`Controller.read_wait`	0.04 s	0.04 s
Total wall-clock	2.86 s	0.58 s (~5x)
`Total clocks (ns)`	144012083	144012083 (identical)
`Mean TTFT / TPOT / ITL`	4.91 / 4.49 / 4.49 ms	identical

Larger workload (NUM_REQ=4 PROMPT_LEN=128 OUTPUT_LEN=128, 129 iters) confirms the scaling: generate_graph is 0.443 s (~3.4 ms / iter) where the subprocess approach would extrapolate to ~8.4 s.

The remaining ~48% of post-fix wall-clock is _load_architecture re-parsing the architecture yaml on every iteration -- that's addressed by a separate cache PR (claude/cache-architecture-yaml).

Test plan

NUM_REQ=2 PROMPT_LEN=64 OUTPUT_LEN=32 ./serving/spec_compression_stress.sh baseline -- Total clocks (ns) = 144012083 unchanged; wall-clock 2.86 s -> 0.58 s
NUM_REQ=4 PROMPT_LEN=128 OUTPUT_LEN=128 ./serving/spec_compression_stress.sh baseline -- same TPOT / ITL numbers, scales as expected
Spot-check self_verify / cpu_verify modes -- spec-decode VerifyJobs also flow through generate_graph
Spot-check a multi-instance / DP run -- generate_graph is called per-instance per-iteration; in-process serialization within a single Python process is identical to subprocess serialization but skips fork cost

Generated by Claude Code

generate_graph() was running ``python -m chakra.src.converter.converter LLM ...`` as a fresh subprocess on every scheduler iteration. Per iteration this cost roughly 65 ms on the analytical-modeling stress workload, of which only ~7 ms was the actual text-trace -> protobuf conversion; the remaining ~58 ms was Python interpreter startup plus chakra module import. Since chakra is already a pip-installed dependency, we can construct ``LLMConverter`` directly and call ``.convert()`` in-process. The interpreter and chakra modules then load once at simulator startup instead of 33 / 129 / 4097+ times per run. Existing relative-path semantics (chdir into the chakra workspace) are preserved; chdir is now wrapped in a try/finally so an exception in ``.convert()`` no longer leaks the cwd to callers. Measured on ``NUM_REQ=2 PROMPT_LEN=64 OUTPUT_LEN=32`` baseline (33 iters, --analytical-modeling, Llama-3.1-8B): generate_graph total : 2.09 s -> 0.093 s (~22x) wall-clock total : 2.86 s -> 0.58 s (~5x) Total clocks (ns) : 144012083 (identical) Per-iter graph_generation: ~65 ms -> ~3 ms.

YWHyuk merged commit d3df2d8 into main May 19, 2026
2 checks passed

YWHyuk mentioned this pull request May 19, 2026

Cache get_config by model_name #38

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inline Chakra LLM converter instead of subprocess fork#37

Inline Chakra LLM converter instead of subprocess fork#37
YWHyuk merged 1 commit into
mainfrom
claude/inline-chakra-converter

YWHyuk commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YWHyuk commented May 19, 2026

Summary

Change

Measurement

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants