Refactor compression pipeline: shared zarr LocalStore, hybrid MPI+threads, in-memory evaluation by kotsaloscv · Pull Request #4 · C2SM/data-compression

kotsaloscv · 2026-04-23T09:16:53Z

Summary

End-to-end refactor of the evaluate_combos / compress_with_optimal / merge_compressed_fields pipeline. The sweep is now fully in-memory, within-node parallelism moves from MPI ranks to a ThreadPoolExecutor, and compressed fields land directly in a shared {dataset}.zarr LocalStore instead of per-field .zarr.zip files that had to be unzipped and re-merged.

What changed

Storage layout

.zarr.zip is gone. compress_with_optimal writes each field directly into a shared {where_to_write}/{dataset}.zarr LocalStore (component = field name). merge_compressed_fields no longer unzips/copies/rezips — it just runs zarr.consolidate_metadata.
Commands renamed: open_zarr_zip_file_and_inspect → open_zarr_and_inspect, from_zarr_zip_to_netcdf → from_zarr_to_netcdf. Both now take a .zarr directory path.
utils.open_dataset drops .zarr.zip support; .nc, .grib, .zarr remain.

Parallelism (hybrid MPI + threads)

evaluate_combos now requires 1 MPI rank per node; within-node parallelism is a ThreadPoolExecutor. Multi-rank-per-node launches are rejected with a clear launch-string hint.
New helpers: detect_node_topology (MPI-3 Split_type, hostname fallback), detect_cores_available (cgroup/Slurm-aware via sched_getaffinity), compute_default_threads_per_rank.
New --threads-per-rank flag (auto-detected if omitted).
--oversubscription-check warns/aborts if OMP_NUM_THREADS / MKL_NUM_THREADS / OPENBLAS_NUM_THREADS / BLOSC_NTHREADS / NUMBA_NUM_THREADS aren't pinned to 1. Zarr v3's internal thread pool is also pinned.
Thread-safe progress_bar and Timer (locks around shared counters/dict).
All sys.exit(1) paths that could hang siblings at the next collective are now comm.Abort(1).

Evaluation pipeline

evaluate_codec_pipeline runs entirely against a MemoryStore — no disk I/O per combo, no zip wrapping. Error norms are accumulated chunk-wise so a full decompressed copy of the sample is never held in memory.
Dask is forced to the synchronous scheduler inside each thread (scoped via with dask.config.set) to prevent nested pools.
Per-combo failures are collected rather than fatal; counts are reduced across ranks and reported on rank 0.

Persistence path (with sharding)

New persist_with_codec_pipeline writes via dask.array.to_zarr with inner chunks + shards; Dask chunks are rechunked to shard shape so each write = one shard.
compress_with_optimal gains --inner-chunk-mib (default 16), --shard-mib (default 512), --threads, and --verify/--no-verify (on by default; disable to skip the re-read pass for trusted combos).

Representative sampling

New build_representative_sample: stride-samples along the leading dim via np.linspace, keeping trailing spatial dims full. Deterministic so evaluate_combos and compress_with_optimal build identical codec spaces.
New --eval-data-size-limit flag (default "5GB") with parse_size helper for GB/GiB/MiB/... strings. Must match between evaluate_combos and compress_with_optimal so the codec-space indices resolve to the same objects.
Old --field-percentage-to-compress is removed.

MPI

New broadcast_numpy uses Bcast (uppercase, buffer-protocol) with shape/dtype metadata piggybacked over pickle. Lifts the payload ceiling from ~2 GB (the old [buf, MPI.BYTE] path would silently cap / corrupt at the 5 GB default sample budget) to ~16 GB for float64.

Results / audit trail

Per-rank streaming CSV config_space_{var}_rank{N}.csv (flushed per row; survives mid-sweep crashes), consolidated into results_{var}.parquet on rank 0. Both passing and filtered-out combos are recorded, distinguished by a keep column.
Per-variable filenames use var (not field_to_compress or "all") so multi-field runs no longer collide.
analyze_clustering now takes required --where-to-write and --var flags instead of silently reading config_space.csv from cwd.

Imports / deps

Heavyweight optional deps (matplotlib, sklearn, plotly, tqdm) are now imported lazily inside perform_clustering / analyze_clustering. evaluate_combos and compress_with_optimal no longer pay their import cost.
pyproject.toml: dropped strict pins on numpy, dask, zarr, numcodecs.
Dockerfile: removed the pip install --force-reinstall "dask[...]" "numpy==..." line that conflicted with the loosened pins.
.gitignore: added *.parquet.

Breaking changes

.zarr.zip output and readers are removed.
Command renames: open_zarr_zip_file_and_inspect → open_zarr_and_inspect, from_zarr_zip_to_netcdf → from_zarr_to_netcdf.
evaluate_combos: where_to_write is now the flag --where-to-write (required); --field-percentage-to-compress removed; --eval-data-size-limit added.
compress_with_optimal: new required topology of same --eval-data-size-limit as the sweep; new --inner-chunk-mib / --shard-mib / --threads / --verify flags.
analyze_clustering: --where-to-write and --var are now required.
evaluate_combos now requires 1 MPI rank per node; relaunch with mpirun -n <NODES> --ntasks-per-node=1 ... (or srun --nodes=<N> --ntasks-per-node=1 ...).

Migration

# Old
mpirun -n 32 dc_toolkit evaluate_combos input.nc /out --field-percentage-to-compress 10
# New
mpirun -n <NODES> --ntasks-per-node=1 \
  dc_toolkit evaluate_combos input.nc \
    --where-to-write /out \
    --eval-data-size-limit 5GB

Pass the same --eval-data-size-limit to compress_with_optimal, otherwise the (comp_idx, filt_idx, ser_idx) returned by the sweep may resolve to codec objects with slightly different parameters (symptom: worse compression ratio than the sweep reported, no error).

…eads, in-memory evaluation

… stored results

Copilot

Pull request overview

This PR refactors the compression/evaluation pipeline to eliminate per-combo disk I/O, switch evaluate_combos to a hybrid MPI (1 rank/node) + threads model, and write compressed outputs directly into a shared {dataset}.zarr LocalStore (with metadata consolidation replacing the old zip/merge flow).

Changes:

Reworked core utilities to support representative in-memory sampling, codec-space generation, threaded evaluation, and sharded Zarr v3 persistence.
Updated UIs and docs/scripts to align with the new CLI interface (--where-to-write, .zarr instead of .zarr.zip, 1-rank-per-node).
Loosened dependency pins and ensured the L1 threshold CSV is packaged in wheels.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
src/dc_toolkit/utils.py	New sampling, chunk/shard sizing, in-memory codec evaluation, persistence into shared LocalStore, and topology/threading helpers.
src/dc_toolkit/data/l1_error_thresholds.csv	Adds local fallback table for per-variable L1 thresholds.
src/dc_toolkit/compression_analysis_ui_web.py	Updates output location/CLI invocation to use `out/` and `--where-to-write`.
src/dc_toolkit/compression_analysis_ui_vcluster.py	Same UI updates for vcluster flow.
src/dc_toolkit/compression_analysis_ui_local.py	Same UI updates for local PyQt flow.
santis.run	New production sweep driver script aligned with 1 rank/node + threads and new flags.
README.md	Documents the new end-to-end workflow, output files, and HPC topology requirements.
pyproject.toml	Drops strict pins, adds `psutil`, and includes CSV as package data.
install_dc_toolkit.sh	Updates install guidance (manual thread env pinning).
docs/PARALLELIZATION.md	New documentation explaining the MPI+threads model and rationale.
Dockerfile	Removes force-reinstall of pinned numpy/dask.
.gitignore	Tracks the L1 threshold CSV despite global `*.csv` ignore; ignores parquet/manifests, etc.

Comments suppressed due to low confidence (1)

src/dc_toolkit/utils.py:1319

Same as above in the sync evaluation path: ratio = count_bytes / count_bytes_stored can divide by zero for empty arrays / zero stored bytes. Add a guard or clearer error handling.

    with Timer("eval.info_complete"):
        info = z.info_complete()
        count_bytes, count_bytes_stored = _info_bytes(info)
        ratio = count_bytes / count_bytes_stored

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…sults Agent-Logs-Url: https://github.com/C2SM/data-compression/sessions/a707fb74-5a22-4ecb-9739-0671ba32ffad Co-authored-by: nfarabullini <41536517+nfarabullini@users.noreply.github.com>

…sults (vcluster) Agent-Logs-Url: https://github.com/C2SM/data-compression/sessions/00362400-8ee0-483c-9e38-2a7b53270757 Co-authored-by: nfarabullini <41536517+nfarabullini@users.noreply.github.com>

…sults Agent-Logs-Url: https://github.com/C2SM/data-compression/sessions/d3fd145d-b2df-4bfc-82ea-308983df16ae Co-authored-by: nfarabullini <41536517+nfarabullini@users.noreply.github.com>

Refactor compression pipeline: shared zarr LocalStore, hybrid MPI+thr…

ad04647

…eads, in-memory evaluation

kotsaloscv requested a review from nfarabullini April 23, 2026 09:16

Refactor compression pipeline

1a192a8

nfarabullini reviewed Apr 23, 2026

View reviewed changes

Comment thread santis.run Outdated

nfarabullini reviewed Apr 23, 2026

View reviewed changes

Comment thread santis.run Outdated

nfarabullini reviewed Apr 23, 2026

View reviewed changes

Comment thread src/dc_toolkit/cli.py

nfarabullini and others added 2 commits April 24, 2026 11:16

edit to UIs following edits

dcdfc88

Refactor compression pipeline: Fixes

4cfd6bb

nfarabullini reviewed Apr 27, 2026

View reviewed changes

Comment thread README.md

kotsaloscv and others added 7 commits April 27, 2026 11:31

Refactor compression pipeline: Fixes

f52efd4

Refactor compression pipeline: Fixes

5fcc046

Refactor compression pipeline: Fixes

e38fc25

edits to santis.run

2bb1e7d

small additional edit

83fbd2c

Refactor compression pipeline: Fix Chunking & Sharding for big fields

625430a

Refactor compression pipeline: Always resume evaulate_combos from the…

1773d4c

… stored results

nfarabullini approved these changes Apr 28, 2026

View reviewed changes

kotsaloscv added 5 commits May 8, 2026 11:33

Fixing Parallelization Strategy & testing on production files

afe0b36

Fixing Parallelization Strategy & testing on production files

8aa40ed

Fixing Parallelization Strategy & testing on production files

4a88c13

Enrich codec parametric space

0498ade

Enrich codec parametric space

8874108

nfarabullini requested a review from Copilot May 20, 2026 08:17

Copilot started reviewing on behalf of nfarabullini May 20, 2026 08:17 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Copilot started work on behalf of nfarabullini May 20, 2026 11:37 View session

Copilot started work on behalf of nfarabullini May 20, 2026 11:38 View session

Copilot AI and others added 2 commits May 20, 2026 11:38

Fix params_str type annotation and use os.path.join in load_scored_re…

035c567

…sults Agent-Logs-Url: https://github.com/C2SM/data-compression/sessions/a707fb74-5a22-4ecb-9739-0671ba32ffad Co-authored-by: nfarabullini <41536517+nfarabullini@users.noreply.github.com>

Fix params_str type annotation and use os.path.join in load_scored_re…

9fc290c

…sults (vcluster) Agent-Logs-Url: https://github.com/C2SM/data-compression/sessions/00362400-8ee0-483c-9e38-2a7b53270757 Co-authored-by: nfarabullini <41536517+nfarabullini@users.noreply.github.com>

Fix params_str type annotation and use os.path.join in load_scored_re…

b98de87

…sults Agent-Logs-Url: https://github.com/C2SM/data-compression/sessions/d3fd145d-b2df-4bfc-82ea-308983df16ae Co-authored-by: nfarabullini <41536517+nfarabullini@users.noreply.github.com>

Copilot finished work on behalf of nfarabullini May 20, 2026 11:40

Copilot AI requested a review from nfarabullini May 20, 2026 11:40

Copilot finished work on behalf of nfarabullini May 20, 2026 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor compression pipeline: shared zarr LocalStore, hybrid MPI+threads, in-memory evaluation#4

Refactor compression pipeline: shared zarr LocalStore, hybrid MPI+threads, in-memory evaluation#4
kotsaloscv wants to merge 19 commits into
mainfrom
hpc_refactoring

kotsaloscv commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kotsaloscv commented Apr 23, 2026

Summary

What changed

Storage layout

Parallelism (hybrid MPI + threads)

Evaluation pipeline

Persistence path (with sharding)

Representative sampling

MPI

Results / audit trail

Imports / deps

Breaking changes

Migration

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants