Skip to content

[r] Support graph list inputs in clustering#337

Open
koefoeden wants to merge 1 commit into
bnprks:mainfrom
koefoeden:cluster-graph-list-input
Open

[r] Support graph list inputs in clustering#337
koefoeden wants to merge 1 commit into
bnprks:mainfrom
koefoeden:cluster-graph-list-input

Conversation

@koefoeden
Copy link
Copy Markdown
Contributor

@koefoeden koefoeden commented May 6, 2026

Context

I ran into a large transient memory spike when clustering a shared-nearest-neighbor graph built from about 463k cells and 20 embedding dimensions. The KNN search completed, but memory climbed sharply during graph clustering. Smaller benchmarks pointed to the igraph::graph_from_adjacency_matrix() conversion inside the current clustering path as the expensive step, rather than KNN search, SNN construction, or the final membership vector.

Representative peak RSS measurements from the same graph construction/clustering path were:

Input / stage Peak RSS
20k cells, KNN only ~0.59-0.66 GiB
20k cells, KNN + SNN ~0.68-0.70 GiB
20k cells, KNN + SNN + Leiden through sparse matrix ~5.98-6.00 GiB
30k cells, sparse-matrix clustering route ~12.58 GiB
30k cells, direct edge-list route ~0.87 GiB
50k cells, sparse-matrix clustering route ~42.9 GiB
50k cells, direct edge-list route ~1.04 GiB

A separate 20k-cell equivalence check produced identical Leiden memberships between the sparse matrix route and the direct edge-list route. This PR upstreams that lower-memory route for BPCells graph-list outputs while leaving the existing sparse matrix behavior unchanged.

Summary

  • Allow cluster_graph_leiden() and cluster_graph_louvain() to accept graph-list outputs from knn_to_snn_graph(return_type = "list") and knn_to_geodesic_graph(return_type = "list").
  • Build igraph objects directly from the edge list for list inputs, while keeping the existing sparse matrix path unchanged.
  • Convert BPCells 0-based edge-list indices to igraph 1-based vertex IDs internally.
  • Ignore self-loops for list inputs to match the existing diag = FALSE matrix behavior.
  • Add regression coverage that matrix and graph-list inputs produce identical Leiden and Louvain memberships on the same graph.

Testing

testthat::test_file("tests/testthat/test-clustering.R")

Result: 13 passed; 2 skipped because optional packages RcppAnnoy and RcppHNSW were not installed in the test environment.

The new regression assertions are in the existing igraph clustering doesn't crash test and did run. The two skipped tests are separate optional ANN tests guarded by skip_if_not_installed("RcppAnnoy") and skip_if_not_installed("RcppHNSW"); I did not install those optional packages for this focused test run.

Disclosure

GPT-5.5 was utilized to help draft this PR and implementation.

@koefoeden koefoeden marked this pull request as draft May 6, 2026 08:37
@koefoeden koefoeden marked this pull request as ready for review May 6, 2026 08:59
@koefoeden koefoeden marked this pull request as draft May 6, 2026 10:36
@koefoeden
Copy link
Copy Markdown
Contributor Author

koefoeden commented May 6, 2026

Changed to draft status due to possible regression detected. Will update!

@koefoeden koefoeden marked this pull request as ready for review May 6, 2026 11:07
@koefoeden
Copy link
Copy Markdown
Contributor Author

No regression afterall. It was a high number of single-tons in my case that resulted in >100 "distinct" clusters. I implemented my own helper to discard clusters below a given number of barcodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant