Skip to content

Slim simulator image (multi-stage + runtime-only deps)#30

Merged
YWHyuk merged 1 commit into
mainfrom
claude/slim-sim-image
May 18, 2026
Merged

Slim simulator image (multi-stage + runtime-only deps)#30
YWHyuk merged 1 commit into
mainfrom
claude/slim-sim-image

Conversation

@YWHyuk
Copy link
Copy Markdown

@YWHyuk YWHyuk commented May 18, 2026

Summary

Cuts the published ghcr.io/psal-postech/llmservingsimspec/sim image from ~3-5 GB to an estimated 300-500 MB by:

  1. Multi-stage Dockerfile — stage 1 (builder) has the C++ toolchain to compile ASTRA-Sim; stage 2 (runtime) carries only the python interpreter and the deps the simulator actually imports. No compilers, no protobuf headers, no .git history.
  2. Runtime-only python depsserving/ imports only pyyaml / pyinstrument / msgspec / pandas / numpy / rich / protobuf (plus chakra via subprocess). transformers / datasets / scikit-learn / xgboost / matplotlib are workload-generator + bench + power-model-training deps and live in the vLLM image (docker-vllm.sh) only, so they're dropped from the sim image.
  3. .dockerignore — keeps the build context lean: skips profiler/perf/, bench/results/, outputs/, tests/, agent_plan/, docs/, .venv*, host-side CMake outputs, *.o / *.a / *.so, __pycache__, *.pyc. .git is intentionally NOT excluded (the builder stage needs it for git submodule update), but the runtime stage drops every .git subtree during cleanup so it never reaches the final image.

What's not in the slim image

  • C++ build toolchain (build-essential, cmake, protobuf-compiler, libprotobuf-dev)
  • Workload generators / bench / power-model-training python deps
  • The astra-sim CMake build tree (CMakeFiles/, _deps/, CMakeCache.txt, object files, static archives)
  • .git/ directories from the repo and every submodule
  • docs/, tests/, agent_plan/, outputs/, profile + bench output dirs

What is in the slim image

  • ubuntu:24.04 + python3 + pip + ca-certificates
  • The runtime python deps serving/ imports
  • The compiled AstraSim_Analytical_Congestion_Unaware binary at its original repo-relative path (so serving/__main__.py's cd astra-sim/ + invocation path keeps working)
  • The chakra python package, installed via pip install --no-deps from the in-repo source

Test plan

  • CI builds successfully (the workflow trigger paths include scripts/sim.Dockerfile and .github/workflows/build-sim-image.yml, but not .dockerignore — PR will need to be re-pushed or the trigger paths extended to catch it; manual workflow_dispatch is a fallback)
  • After merge, docker pull ghcr.io/psal-postech/llmservingsimspec/sim:latest and docker images shows a significantly smaller size
  • docker run --rm -it ghcr.io/psal-postech/llmservingsimspec/sim:latest \ python -m serving --help works without further install
  • A representative python -m serving ... --dataset workloads/example_trace.jsonl --output /tmp/x.csv run completes inside the container

Notes

  • The trigger paths in the existing workflow don't currently include .dockerignore. If we want every dockerignore tweak to rebuild the image, that should be appended to the path filter; for now this PR's changes hit scripts/sim.Dockerfile which already triggers.
  • If anyone needs the broader install-sim.sh dep set inside an image (workload generation, training the power model), they can keep using the vLLM image (docker-vllm.sh) or run install-sim.sh on bare metal — that script is unchanged.

Generated by Claude Code

Cuts the published ghcr.io image from ~3-5 GB to an estimated 300-500
MB by separating build vs. runtime concerns and dropping deps the
simulator doesn't actually use.

scripts/sim.Dockerfile — now multi-stage:

* Stage 1 (builder, ubuntu:24.04 + build-essential + cmake +
  protobuf-compiler): inits submodules, compiles ASTRA-Sim's
  analytical backend, then aggressively prunes CMake build trees,
  downloaded _deps, object files, and every .git directory before
  the next stage copies its tree.
* Stage 2 (runtime, ubuntu:24.04): only python3 + pip + ca-certs.
  No compilers, no protobuf-dev headers, no .git. Installs the
  narrower runtime pip set (pyyaml / pyinstrument / msgspec / pandas
  / numpy / rich / protobuf) instead of the broader scripts/install-sim.sh
  set — transformers / datasets / scikit-learn / xgboost / matplotlib
  are workload-generator + bench + power-model-training deps that the
  serving/ runtime never imports, so they live in the vLLM image
  (scripts/docker-vllm.sh) only.

.dockerignore (new) keeps the build context lean:

* perf/ + bench/results/ + outputs/ — local run artefacts
* tests/ + agent_plan/ + docs/ — non-runtime
* host-side CMake outputs + *.o / *.a / *.so — would collide with the
  builder stage's fresh compile
* .venv / .venv-cpu — local install-vllm* venvs
* __pycache__ / *.pyc — universal

.git is intentionally NOT excluded — the builder stage needs it for
git submodule update, and the runtime stage drops every .git subtree
during cleanup so it never reaches the final image.
@YWHyuk YWHyuk merged commit 69a6b58 into main May 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants