Use numpy sort-and-slice for per-cluster statistics by sharifhsn · Pull Request #28 · saeyslab/FlowSOM_Python

sharifhsn · 2026-03-22T06:15:21Z

What does this implement/fix?

Replaces per-cluster pandas boolean indexing in _update_derived_values and test_outliers with numpy sort-and-slice.

The existing code loops over every SOM node and selects cells with df[df["clustering"] == cl]. For a 10x10 grid that's 100 full-DataFrame scans per call, each constructing temporary Series and DataFrame objects. In test_outliers the same pattern appears four times per node.

The new code sorts the data once by cluster label, finds contiguous boundaries with np.searchsorted, and slices directly into the numpy array:

sort_idx = np.argsort(labels, kind="stable")
X_sorted = X[sort_idx]
boundaries = np.searchsorted(labels_sorted, np.arange(n_nodes + 1))

for cl in range(n_nodes):
    chunk = X_sorted[boundaries[cl]:boundaries[cl + 1]]

This is fewer lines, avoids the pandas-to-numpy round-trips, and computes statistics on contiguous memory.

Benchmarks

Test FCS file (19,225 cells, 7 markers, 10x10 grid), isolated _update_derived_values (n=50):

	Mean	σ
Before	95.2 ms	2.3 ms
After	50.9 ms	0.8 ms

Outputs are numerically identical (verified: shapes, values, and outlier counts match across all 100 nodes).

No new dependencies. No API changes. All 38 existing tests pass.

Replace pandas boolean indexing in _update_derived_values and test_outliers with a sort-once-then-slice pattern. The previous implementation iterated over each SOM node and used pandas boolean indexing (df[df["clustering"] == cl]) to select that node's cells. For a 10x10 grid this means 100 separate scans of the full DataFrame, each creating temporary Series and DataFrame objects. In test_outliers the same pattern appeared four times per node, totaling 400 pandas indexing operations. The new approach sorts the data array once by cluster label using np.argsort, then uses np.searchsorted to find contiguous boundaries. Each cluster's data is accessed as a cheap numpy slice with no copying. Per-cluster statistics (median, std, CV, MAD) are computed directly on these contiguous views. Benchmark (_update_derived_values, 19225 cells, 10x10 grid, n=50): Before: 95.2 ms ± 2.3 ms After: 50.9 ms ± 0.8 ms (1.87x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sharifhsn changed the title ~~Speed up per-cluster statistics with numpy sort-and-slice~~ Use numpy sort-and-slice for per-cluster statistics Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use numpy sort-and-slice for per-cluster statistics#28

Use numpy sort-and-slice for per-cluster statistics#28
sharifhsn wants to merge 1 commit intosaeyslab:mainfrom
sharifhsn:perf/numpy-cluster-stats

sharifhsn commented Mar 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sharifhsn commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix?

Benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sharifhsn commented Mar 22, 2026 •

edited

Loading