Skip to content

Inline Chakra LLM converter instead of subprocess fork#37

Merged
YWHyuk merged 1 commit into
mainfrom
claude/inline-chakra-converter
May 19, 2026
Merged

Inline Chakra LLM converter instead of subprocess fork#37
YWHyuk merged 1 commit into
mainfrom
claude/inline-chakra-converter

Conversation

@YWHyuk
Copy link
Copy Markdown

@YWHyuk YWHyuk commented May 19, 2026

Summary

generate_graph() was running python -m chakra.src.converter.converter LLM ... as a fresh subprocess on every scheduler iteration. Profiling on the spec_compression_stress.sh baseline workload showed this cost ~65 ms per iteration, of which only ~7 ms was actual conversion -- the rest was Python interpreter startup (~12 ms) plus chakra module import (~47 ms).

Since chakra is already a pip-installed dependency, LLMConverter can be constructed and called in-process. The interpreter and chakra modules then load once at simulator startup instead of N times per run (N = 33 / 129 / 4097+ depending on workload).

Change

serving/core/graph_generator.py (12 / -15):

  • Drop import subprocess, add from chakra.src.converter.llm_converter import LLMConverter at module top
  • Replace the subprocess.run([...]) call with a direct LLMConverter(input_path, output_path, num_npus, npu_offset, enable_local_offloading).convert()
  • Wrap chdir + convert in try/finally so an exception in .convert() no longer leaks the cwd to callers (the previous subprocess.run had check=False so failures were swallowed; in-process, exceptions propagate, which is the right behavior)

Existing relative-path semantics (chdir into the chakra workspace, then use ../../../inputs/... paths) are preserved -- the same paths are now passed to the LLMConverter constructor instead of being assembled into a CLI command.

Measurement

NUM_REQ=2 PROMPT_LEN=64 OUTPUT_LEN=32 ./serving/spec_compression_stress.sh baseline (33 iters, --analytical-modeling, Llama-3.1-8B), pyinstrument breakdown:

Before (main) After
generate_graph total 2.09 s 0.093 s (~22x)
↳ per iteration ~65 ms (subprocess + import + work) ~3 ms (just .convert() work)
generate_trace 0.51 s 0.41 s (unchanged in this PR)
Controller.read_wait 0.04 s 0.04 s
Total wall-clock 2.86 s 0.58 s (~5x)
Total clocks (ns) 144012083 144012083 (identical)
Mean TTFT / TPOT / ITL 4.91 / 4.49 / 4.49 ms identical

Larger workload (NUM_REQ=4 PROMPT_LEN=128 OUTPUT_LEN=128, 129 iters) confirms the scaling: generate_graph is 0.443 s (~3.4 ms / iter) where the subprocess approach would extrapolate to ~8.4 s.

The remaining ~48% of post-fix wall-clock is _load_architecture re-parsing the architecture yaml on every iteration -- that's addressed by a separate cache PR (claude/cache-architecture-yaml).

Test plan

  • NUM_REQ=2 PROMPT_LEN=64 OUTPUT_LEN=32 ./serving/spec_compression_stress.sh baseline -- Total clocks (ns) = 144012083 unchanged; wall-clock 2.86 s -> 0.58 s
  • NUM_REQ=4 PROMPT_LEN=128 OUTPUT_LEN=128 ./serving/spec_compression_stress.sh baseline -- same TPOT / ITL numbers, scales as expected
  • Spot-check self_verify / cpu_verify modes -- spec-decode VerifyJobs also flow through generate_graph
  • Spot-check a multi-instance / DP run -- generate_graph is called per-instance per-iteration; in-process serialization within a single Python process is identical to subprocess serialization but skips fork cost

Generated by Claude Code

generate_graph() was running ``python -m chakra.src.converter.converter
LLM ...`` as a fresh subprocess on every scheduler iteration. Per
iteration this cost roughly 65 ms on the analytical-modeling stress
workload, of which only ~7 ms was the actual text-trace -> protobuf
conversion; the remaining ~58 ms was Python interpreter startup plus
chakra module import.

Since chakra is already a pip-installed dependency, we can construct
``LLMConverter`` directly and call ``.convert()`` in-process. The
interpreter and chakra modules then load once at simulator startup
instead of 33 / 129 / 4097+ times per run. Existing relative-path
semantics (chdir into the chakra workspace) are preserved; chdir is now
wrapped in a try/finally so an exception in ``.convert()`` no longer
leaks the cwd to callers.

Measured on ``NUM_REQ=2 PROMPT_LEN=64 OUTPUT_LEN=32`` baseline (33
iters, --analytical-modeling, Llama-3.1-8B):

  generate_graph total : 2.09 s -> 0.093 s  (~22x)
  wall-clock total     : 2.86 s -> 0.58 s   (~5x)
  Total clocks (ns)    : 144012083          (identical)

Per-iter graph_generation: ~65 ms -> ~3 ms.
@YWHyuk YWHyuk merged commit d3df2d8 into main May 19, 2026
2 checks passed
@YWHyuk YWHyuk mentioned this pull request May 19, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants