Skip to content

Use numpy sort-and-slice for per-cluster statistics#28

Open
sharifhsn wants to merge 1 commit intosaeyslab:mainfrom
sharifhsn:perf/numpy-cluster-stats
Open

Use numpy sort-and-slice for per-cluster statistics#28
sharifhsn wants to merge 1 commit intosaeyslab:mainfrom
sharifhsn:perf/numpy-cluster-stats

Conversation

@sharifhsn
Copy link
Copy Markdown

@sharifhsn sharifhsn commented Mar 22, 2026

What does this implement/fix?

Replaces per-cluster pandas boolean indexing in _update_derived_values and test_outliers with numpy sort-and-slice.

The existing code loops over every SOM node and selects cells with df[df["clustering"] == cl]. For a 10x10 grid that's 100 full-DataFrame scans per call, each constructing temporary Series and DataFrame objects. In test_outliers the same pattern appears four times per node.

The new code sorts the data once by cluster label, finds contiguous boundaries with np.searchsorted, and slices directly into the numpy array:

sort_idx = np.argsort(labels, kind="stable")
X_sorted = X[sort_idx]
boundaries = np.searchsorted(labels_sorted, np.arange(n_nodes + 1))

for cl in range(n_nodes):
    chunk = X_sorted[boundaries[cl]:boundaries[cl + 1]]

This is fewer lines, avoids the pandas-to-numpy round-trips, and computes statistics on contiguous memory.

Benchmarks

Test FCS file (19,225 cells, 7 markers, 10x10 grid), isolated _update_derived_values (n=50):

Mean σ
Before 95.2 ms 2.3 ms
After 50.9 ms 0.8 ms

Outputs are numerically identical (verified: shapes, values, and outlier counts match across all 100 nodes).

No new dependencies. No API changes. All 38 existing tests pass.

Replace pandas boolean indexing in _update_derived_values and
test_outliers with a sort-once-then-slice pattern.

The previous implementation iterated over each SOM node and used
pandas boolean indexing (df[df["clustering"] == cl]) to select
that node's cells. For a 10x10 grid this means 100 separate scans
of the full DataFrame, each creating temporary Series and DataFrame
objects. In test_outliers the same pattern appeared four times per
node, totaling 400 pandas indexing operations.

The new approach sorts the data array once by cluster label using
np.argsort, then uses np.searchsorted to find contiguous boundaries.
Each cluster's data is accessed as a cheap numpy slice with no
copying. Per-cluster statistics (median, std, CV, MAD) are computed
directly on these contiguous views.

Benchmark (_update_derived_values, 19225 cells, 10x10 grid, n=50):
  Before: 95.2 ms ± 2.3 ms
  After:  50.9 ms ± 0.8 ms  (1.87x)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sharifhsn sharifhsn changed the title Speed up per-cluster statistics with numpy sort-and-slice Use numpy sort-and-slice for per-cluster statistics Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant