Inline Chakra LLM converter instead of subprocess fork#37
Merged
Conversation
generate_graph() was running ``python -m chakra.src.converter.converter LLM ...`` as a fresh subprocess on every scheduler iteration. Per iteration this cost roughly 65 ms on the analytical-modeling stress workload, of which only ~7 ms was the actual text-trace -> protobuf conversion; the remaining ~58 ms was Python interpreter startup plus chakra module import. Since chakra is already a pip-installed dependency, we can construct ``LLMConverter`` directly and call ``.convert()`` in-process. The interpreter and chakra modules then load once at simulator startup instead of 33 / 129 / 4097+ times per run. Existing relative-path semantics (chdir into the chakra workspace) are preserved; chdir is now wrapped in a try/finally so an exception in ``.convert()`` no longer leaks the cwd to callers. Measured on ``NUM_REQ=2 PROMPT_LEN=64 OUTPUT_LEN=32`` baseline (33 iters, --analytical-modeling, Llama-3.1-8B): generate_graph total : 2.09 s -> 0.093 s (~22x) wall-clock total : 2.86 s -> 0.58 s (~5x) Total clocks (ns) : 144012083 (identical) Per-iter graph_generation: ~65 ms -> ~3 ms.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
generate_graph()was runningpython -m chakra.src.converter.converter LLM ...as a fresh subprocess on every scheduler iteration. Profiling on thespec_compression_stress.sh baselineworkload showed this cost ~65 ms per iteration, of which only ~7 ms was actual conversion -- the rest was Python interpreter startup (~12 ms) plus chakra module import (~47 ms).Since chakra is already a pip-installed dependency,
LLMConvertercan be constructed and called in-process. The interpreter and chakra modules then load once at simulator startup instead of N times per run (N = 33 / 129 / 4097+ depending on workload).Change
serving/core/graph_generator.py(12 / -15):import subprocess, addfrom chakra.src.converter.llm_converter import LLMConverterat module topsubprocess.run([...])call with a directLLMConverter(input_path, output_path, num_npus, npu_offset, enable_local_offloading).convert()try/finallyso an exception in.convert()no longer leaks the cwd to callers (the previoussubprocess.runhadcheck=Falseso failures were swallowed; in-process, exceptions propagate, which is the right behavior)Existing relative-path semantics (chdir into the chakra workspace, then use
../../../inputs/...paths) are preserved -- the same paths are now passed to theLLMConverterconstructor instead of being assembled into a CLI command.Measurement
NUM_REQ=2 PROMPT_LEN=64 OUTPUT_LEN=32 ./serving/spec_compression_stress.sh baseline(33 iters, --analytical-modeling, Llama-3.1-8B), pyinstrument breakdown:generate_graphtotal.convert()work)generate_traceController.read_waitTotal clocks (ns)Mean TTFT / TPOT / ITLLarger workload (
NUM_REQ=4 PROMPT_LEN=128 OUTPUT_LEN=128, 129 iters) confirms the scaling:generate_graphis 0.443 s (~3.4 ms / iter) where the subprocess approach would extrapolate to ~8.4 s.The remaining ~48% of post-fix wall-clock is
_load_architecturere-parsing the architecture yaml on every iteration -- that's addressed by a separate cache PR (claude/cache-architecture-yaml).Test plan
NUM_REQ=2 PROMPT_LEN=64 OUTPUT_LEN=32 ./serving/spec_compression_stress.sh baseline--Total clocks (ns) = 144012083unchanged; wall-clock 2.86 s -> 0.58 sNUM_REQ=4 PROMPT_LEN=128 OUTPUT_LEN=128 ./serving/spec_compression_stress.sh baseline-- same TPOT / ITL numbers, scales as expectedself_verify/cpu_verifymodes -- spec-decode VerifyJobs also flow throughgenerate_graphgenerate_graphis called per-instance per-iteration; in-process serialization within a single Python process is identical to subprocess serialization but skips fork costGenerated by Claude Code