RustQC for nf-core/sarek: Single-Pass CRAM QC

## Problem

Sarek reads each sample's CRAM file **7 times independently** during QC — once per tool, per checkpoint. For a 30× WGS sample (~50 GB CRAM), each pass means full decompression and iteration of every aligned read. The most expensive single process is `bcftools mpileup` for NGSCheckMate, which takes over 2 hours on WES despite only needing data at ~10K genomic positions.

### Current QC architecture

Sarek runs QC at two or three checkpoints after alignment. At each checkpoint, multiple tools independently read the entire CRAM:

**Checkpoint 1 — Post-MarkDuplicates** (`CRAM_QC_MOSDEPTH_SAMTOOLS`)

| Tool | What it computes | CRAM pass |
|------|-----------------|-----------|
| `samtools stats` | Mapping rate, insert size, base quality, GC content, error rates, duplicate %, read lengths | 1 full pass |
| `mosdepth` | Per-window depth (500 bp), coverage distribution, per-chromosome summary | 1 full pass |

**Checkpoint 2 — Post-BQSR** (`CRAM_QC_RECAL`, runs only when base recalibration is enabled)

| Tool | What it computes | CRAM pass |
|------|-----------------|-----------|
| `samtools stats` | Same as above, on recalibrated CRAM | 1 full pass |
| `mosdepth` | Same as above, on recalibrated CRAM | 1 full pass |

**Checkpoint 3 — Sample identity** (`BAM_NGSCHECKMATE` within `CRAM_SAMPLEQC`)

| Tool | What it computes | CRAM pass |
|------|-----------------|-----------|
| `bcftools mpileup` (chr1) | Genotypes at known SNPs on chromosome 1 | 1 full pass |
| `bcftools mpileup` (chr17) | Genotypes at known SNPs on chromosome 17 | 1 full pass |
| `ncm.py` | Pairwise sample correlation from VCFs | N/A (reads VCFs) |

The `CRAM_SAMPLEQC` subworkflow also calls `CRAM_QC_RECAL` (another `samtools stats` + `mosdepth` pass), but only when BQSR is enabled.

### Empirical runtimes (WES, HCC1395 tumor/normal, r6id.4xlarge)

| Process | Normal | Tumor |
|---------|--------|-------|
| samtools stats (post-MarkDup) | 4m 47s | 6m 29s |
| mosdepth (post-MarkDup) | 2m 23s | 2m 50s |
| samtools stats (post-BQSR) | 4m 33s | 5m 26s |
| mosdepth (post-BQSR) | 2m 03s | 2m 55s |
| bcftools mpileup chr1 | 1h 49m | 2h 14m |
| bcftools mpileup chr17 | 33m | 40m |

The mpileup steps dominate — they account for over 80% of all QC wall-clock time.

### Source files involved

| File | What it does |
|------|-------------|
| `subworkflows/local/cram_qc_mosdepth_samtools/main.nf` | Runs samtools stats + mosdepth in parallel |
| `subworkflows/local/cram_sampleqc/main.nf` | Wraps CRAM_QC_RECAL + BAM_NGSCHECKMATE |
| `subworkflows/nf-core/bam_ngscheckmate/main.nf` | Runs bcftools mpileup per sample, then ncm.py |
| `modules/nf-core/bcftools/mpileup/main.nf` | bcftools mpileup → bcftools call → VCF |
| `modules/nf-core/ngscheckmate/ncm/main.nf` | ncm.py: pairwise correlation from collected VCFs |

---

## Proposal: `rustqc-align`

A single Rust binary that replaces `samtools stats`, `mosdepth`, and the `bcftools mpileup` genotyping step by computing all three in one streaming pass over the CRAM file.

### Core algorithm

```
INPUTS:
  - CRAM/BAM file (coordinate-sorted, indexed)
  - Reference FASTA
  - SNP BED file (~10K positions, from NGSCheckMate)
  - Optional: target BED for WES regions

STARTUP:
  Load SNP positions into a sorted per-chromosome array
  Initialise stats accumulators (alignment, depth, genotyping)

STREAMING PASS:
  for each record in CRAM:
      // --- samtools stats equivalent ---
      Update mapping counters (mapped/unmapped/duplicate/supplementary)
      Update insert size histogram (from TLEN for proper pairs)
      Update base quality distribution (per-cycle)
      Update GC content histogram (from sequence)
      Update read length distribution
      Update mismatch/error rates (from MD tag or NM tag)
      Update MAPQ distribution

      // --- mosdepth equivalent ---
      For each aligned position in [record.start, record.end]:
          Increment depth counter
      At window boundaries (every 500 bp):
          Flush window depth to output
          Update coverage distribution histogram

      // --- NGSCheckMate genotyping ---
      Check if record overlaps any loaded SNP positions
        (binary search or pointer advance — almost always zero hits)
      For each overlapping SNP:
          If MAPQ >= threshold and not duplicate:
              Resolve base at SNP position via CIGAR alignment
              If base quality >= threshold:
                  Increment ref/alt allele counter at that position

OUTPUT:
  Emit samtools stats format file (.stats)
  Emit mosdepth output files (.summary.txt, .dist.txt, .regions.bed.gz)
  Emit per-sample VCF at SNP positions (for ncm.py)
```

### Why NGSCheckMate genotyping is essentially free

The SNP BED file contains ~10,000 positions across the ~3 billion base human genome. For a coordinate-sorted BAM, the SNP positions can be stored in a sorted array and checked with a single pointer that advances alongside the genomic position of each record.

For most reads, the check is: "is my position past the next SNP? No → skip." This is a single integer comparison per read. Only the ~0.003% of reads overlapping a SNP position require CIGAR-resolved base extraction.

The current approach runs `bcftools mpileup` as a separate full-genome pileup engine, building complete pileup columns at every position, computing genotype likelihoods with a probabilistic model, and piping through `bcftools call`. This is massively over-engineered for the purpose of extracting allele counts at 10K known positions.

Memory overhead for the genotyping component: ~10K × 16 bytes = 160 KB.

### What bcftools mpileup currently does vs. what's actually needed

| Feature | bcftools mpileup | RustQC needs |
|---------|-----------------|-------------|
| Full-genome pileup | Yes | No — only ~10K positions |
| BAQ recalculation | Yes | No |
| Genotype likelihood model (PL/GL) | Yes | No — just allele counts |
| Indel realignment | Yes | No |
| Multi-sample calling | Yes | No — per-sample only |
| VCF output with full annotation | Yes | Minimal VCF with GT:AD fields |

### Output files and compatibility

RustQC must produce output files that are compatible with two downstream consumers: MultiQC and `ncm.py`. No other process in the sarek pipeline consumes these QC outputs.

**samtools stats output** → consumed by MultiQC

The samtools stats module in MultiQC parses the `SN`, `IS`, `GCF`, `GCL`, `RL`, `FFQ`, `LFQ`, `MPC` sections of the `.stats` file. RustQC must emit these sections in the same tab-delimited format. This format is stable and well-documented.

**mosdepth output** → consumed by MultiQC

MultiQC's mosdepth module reads:

- `*.mosdepth.summary.txt` — per-chromosome mean depth
- `*.mosdepth.global.dist.txt` — cumulative coverage distribution
- `*.regions.bed.gz` (optional) — per-region depth for WES targets

These are simple TSV files. The format is straightforward to replicate.

**NGSCheckMate VCF** → consumed by `ncm.py`

`ncm.py` reads standard VCF files and extracts genotype information at the SNP positions. The per-sample VCFs produced by `bcftools mpileup | bcftools call` are published to `{outdir}/reports/ngscheckmate/vcfs/` but are **not consumed by any other pipeline process**. They exist purely as audit trail for the sample matching.

The VCF format required is minimal:

```
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    1234567 .       A       G       .       .       DP=45   GT:AD:DP  0/1:22,23:45
```

The `ncm.py` correlation algorithm fundamentally just needs allele fractions at each SNP position. The existing pipeline passes `--ploidy 1 -c` to `bcftools call`, producing haploid calls — a deliberate simplification since NGSCheckMate only needs to distinguish homozygous ref, heterozygous, and homozygous alt.

RustQC can assign GT based on simple allele fraction thresholds (e.g., <0.15 = 0/0, 0.15–0.85 = 0/1, >0.85 = 1/1) without running a proper genotype likelihood model, and `ncm.py` will work identically.

### Pipeline integration

The integration point is clean. Currently, sarek calls two subworkflows at each checkpoint:

```
CRAM_QC_MOSDEPTH_SAMTOOLS(cram, fasta, intervals)
CRAM_SAMPLEQC(cram, ngscheckmate_bed, fasta, ...)
```

These would be replaced by a single module call:

```
RUSTQC_ALIGN(cram, fasta, intervals, ngscheckmate_bed)
```

The outputs map directly:

| Current output | RustQC equivalent | Consumer |
|---------------|-------------------|----------|
| `SAMTOOLS_STATS.out.stats` | `rustqc_align.stats` | MultiQC |
| `MOSDEPTH.out.summary` | `rustqc_align.mosdepth.summary.txt` | MultiQC |
| `MOSDEPTH.out.dist` | `rustqc_align.mosdepth.global.dist.txt` | MultiQC |
| `MOSDEPTH.out.regions` | `rustqc_align.regions.bed.gz` | MultiQC |
| `BCFTOOLS_MPILEUP.out.vcf` | `rustqc_align.ngscheckmate.vcf.gz` | ncm.py, published |

`ncm.py` continues to run as a separate downstream aggregation step — it's fast (works on small VCFs) and doesn't need replacing.

### What changes per checkpoint

| Checkpoint | Current passes | RustQC passes | Saving |
|-----------|---------------|--------------|--------|
| Post-MarkDup | 2 (stats + mosdepth) | 1 | 1 pass |
| Post-BQSR | 2 (stats + mosdepth) | 1 | 1 pass |
| Sample QC | 2 (mpileup chr1 + chr17) | 0 (folded into checkpoint 1 or 2) | 2 passes |
| **Total per sample** | **6–7** | **1–2** | **4–5 passes eliminated** |

The NGSCheckMate genotyping no longer needs its own checkpoint at all — it piggybacks on whichever stats checkpoint runs last (post-BQSR if enabled, post-MarkDup otherwise).

---

## Estimated impact

### Per-sample savings

| Scenario | Current QC time | With RustQC | Saved |
|----------|----------------|-------------|-------|
| WES (HCC1395-like) | ~3.5 hours | ~10 minutes | ~3 hours 20 min |
| 30× WGS | ~7 hours | ~45 minutes | ~6 hours 15 min |
| 60× WGS | ~14 hours | ~1.5 hours | ~12 hours 30 min |

The dramatic savings come almost entirely from eliminating the mpileup steps. The stats + mosdepth merge saves minutes; the mpileup elimination saves hours.

### Container launch savings

Each eliminated process also eliminates a container launch on cloud executors. On AWS Batch, container scheduling + EBS volume attachment adds ~900 seconds of wall-clock overhead per task.

| Cohort | Eliminated processes | Container overhead saved |
|--------|---------------------|------------------------|
| 100 WES samples | ~500 | ~125 hours wall-clock |
| 100 WGS samples | ~500 | ~125 hours wall-clock |
| 1000 WGS samples | ~5000 | ~1250 hours wall-clock |

### Cost projection (AWS spot, r6id.4xlarge @ $0.38/hr)

| Cohort | CPU hours saved | Approx. cost saved |
|--------|----------------|-------------------|
| 100 WES tumor/normal | ~660 hours | ~$250 |
| 100 WGS 30× | ~1,250 hours | ~$475 |
| 1000 WGS 30× | ~12,500 hours | ~$4,750 |

---

## Implementation approach

### Rust crate dependencies

| Crate | Purpose |
|-------|---------|
| `noodles` or `rust-htslib` | CRAM/BAM reading and CIGAR parsing |
| `noodles-fasta` | Reference sequence access |
| `flate2` / `bgzip` | Compressed output (BED, VCF) |
| `clap` | CLI argument parsing |

`noodles` is a pure-Rust htslib alternative with no C dependencies, which simplifies compilation and container builds. `rust-htslib` wraps the C htslib and may offer better CRAM decompression performance — benchmarking needed.

### Output validation strategy

Before adoption, RustQC outputs must be validated against the existing tools on real data:

1. Run both RustQC and the existing tools on the same CRAM
2. Compare samtools stats output field-by-field (allow for minor floating-point differences in percentage fields)
3. Compare mosdepth depth values per window (should be identical)
4. Compare NGSCheckMate VCF genotypes at each SNP position
5. Run `ncm.py` on both VCF sets and confirm identical correlation matrices

### Phased delivery

**Phase 1: `samtools stats` + `mosdepth` equivalent**

Implement the streaming stats and depth computation. This is the lowest-risk component — the algorithms are well-understood accumulator patterns, and the output formats are stable TSVs. Can be deployed as a drop-in replacement at the existing checkpoints without touching the NGSCheckMate workflow.

**Phase 2: Add targeted genotyping**

Add the SNP position lookup and allele counting to the same pass. Emit a minimal VCF. Validate against `bcftools mpileup` output on a reference dataset. Once validated, the `BAM_NGSCHECKMATE` subworkflow's mpileup step can be removed, with its VCF input replaced by RustQC's output.

**Phase 3: Optimisation and cross-pipeline rollout**

Profile and optimise the hot path (CRAM decompression is likely the bottleneck). Consider multi-threaded decompression. Adapt for atacseq/chipseq (drop the genotyping component, add insert size distribution for ATAC fragment size analysis).

---

## Risks and considerations

**CRAM decompression is the real bottleneck.** Both samtools and mosdepth are fast C tools — the time is dominated by CRAM decompression, not computation. A Rust tool reading the same CRAM still has to decompress it. The saving comes from doing it once instead of 6–7 times, not from Rust being faster than C at decompression.

**Output format fidelity.** MultiQC parsers are format-sensitive. Any deviation in the stats file format (missing sections, different column ordering, different precision) will cause MultiQC to fail silently or produce incorrect plots. Extensive regression testing against real samtools output is essential.

**Edge cases in CIGAR parsing.** The genotyping component needs to correctly resolve query positions from CIGAR strings, handling soft clips, insertions, deletions, and supplementary alignments. This is well-trodden ground but a source of subtle bugs.

**Community adoption.** samtools and mosdepth are beloved, trusted tools. Replacing them requires demonstrating bit-level output compatibility and significant performance gains. The NGSCheckMate mpileup elimination is the compelling story — "same results, 3 hours faster per sample" is easy to sell.


Current output	RustQC equivalent	Consumer
`SAMTOOLS_STATS.out.stats`	`rustqc_align.stats`	MultiQC
`MOSDEPTH.out.summary`	`rustqc_align.mosdepth.summary.txt`	MultiQC
`MOSDEPTH.out.dist`	`rustqc_align.mosdepth.global.dist.txt`	MultiQC
`MOSDEPTH.out.regions`	`rustqc_align.regions.bed.gz`	MultiQC
`BCFTOOLS_MPILEUP.out.vcf`	`rustqc_align.ngscheckmate.vcf.gz`	ncm.py, published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RustQC for nf-core/sarek: Single-Pass CRAM QC #18

Problem

Current QC architecture

Empirical runtimes (WES, HCC1395 tumor/normal, r6id.4xlarge)

Source files involved

Proposal: `rustqc-align`

Core algorithm

Why NGSCheckMate genotyping is essentially free

What bcftools mpileup currently does vs. what's actually needed

Output files and compatibility

Pipeline integration

What changes per checkpoint

Estimated impact

Per-sample savings

Container launch savings

Cost projection (AWS spot, r6id.4xlarge @ $0.38/hr)

Implementation approach

Rust crate dependencies

Output validation strategy

Phased delivery

Risks and considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tool	What it computes	CRAM pass
`samtools stats`	Mapping rate, insert size, base quality, GC content, error rates, duplicate %, read lengths	1 full pass
`mosdepth`	Per-window depth (500 bp), coverage distribution, per-chromosome summary	1 full pass

Tool	What it computes	CRAM pass
`samtools stats`	Same as above, on recalibrated CRAM	1 full pass
`mosdepth`	Same as above, on recalibrated CRAM	1 full pass

Tool	What it computes	CRAM pass
`bcftools mpileup` (chr1)	Genotypes at known SNPs on chromosome 1	1 full pass
`bcftools mpileup` (chr17)	Genotypes at known SNPs on chromosome 17	1 full pass
`ncm.py`	Pairwise sample correlation from VCFs	N/A (reads VCFs)

Process	Normal	Tumor
samtools stats (post-MarkDup)	4m 47s	6m 29s
mosdepth (post-MarkDup)	2m 23s	2m 50s
samtools stats (post-BQSR)	4m 33s	5m 26s
mosdepth (post-BQSR)	2m 03s	2m 55s
bcftools mpileup chr1	1h 49m	2h 14m
bcftools mpileup chr17	33m	40m

File	What it does
`subworkflows/local/cram_qc_mosdepth_samtools/main.nf`	Runs samtools stats + mosdepth in parallel
`subworkflows/local/cram_sampleqc/main.nf`	Wraps CRAM_QC_RECAL + BAM_NGSCHECKMATE
`subworkflows/nf-core/bam_ngscheckmate/main.nf`	Runs bcftools mpileup per sample, then ncm.py
`modules/nf-core/bcftools/mpileup/main.nf`	bcftools mpileup → bcftools call → VCF
`modules/nf-core/ngscheckmate/ncm/main.nf`	ncm.py: pairwise correlation from collected VCFs

Feature	bcftools mpileup	RustQC needs
Full-genome pileup	Yes	No — only ~10K positions
BAQ recalculation	Yes	No
Genotype likelihood model (PL/GL)	Yes	No — just allele counts
Indel realignment	Yes	No
Multi-sample calling	Yes	No — per-sample only
VCF output with full annotation	Yes	Minimal VCF with GT:AD fields

Checkpoint	Current passes	RustQC passes	Saving
Post-MarkDup	2 (stats + mosdepth)	1	1 pass
Post-BQSR	2 (stats + mosdepth)	1	1 pass
Sample QC	2 (mpileup chr1 + chr17)	0 (folded into checkpoint 1 or 2)	2 passes
Total per sample	6–7	1–2	4–5 passes eliminated

Scenario	Current QC time	With RustQC	Saved
WES (HCC1395-like)	~3.5 hours	~10 minutes	~3 hours 20 min
30× WGS	~7 hours	~45 minutes	~6 hours 15 min
60× WGS	~14 hours	~1.5 hours	~12 hours 30 min

Cohort	Eliminated processes	Container overhead saved
100 WES samples	~500	~125 hours wall-clock
100 WGS samples	~500	~125 hours wall-clock
1000 WGS samples	~5000	~1250 hours wall-clock

Cohort	CPU hours saved	Approx. cost saved
100 WES tumor/normal	~660 hours	~$250
100 WGS 30×	~1,250 hours	~$475
1000 WGS 30×	~12,500 hours	~$4,750

Crate	Purpose
`noodles` or `rust-htslib`	CRAM/BAM reading and CIGAR parsing
`noodles-fasta`	Reference sequence access
`flate2` / `bgzip`	Compressed output (BED, VCF)
`clap`	CLI argument parsing

RustQC for nf-core/sarek: Single-Pass CRAM QC #18

Description

Problem

Current QC architecture

Empirical runtimes (WES, HCC1395 tumor/normal, r6id.4xlarge)

Source files involved

Proposal: rustqc-align

Core algorithm

Why NGSCheckMate genotyping is essentially free

What bcftools mpileup currently does vs. what's actually needed

Output files and compatibility

Pipeline integration

What changes per checkpoint

Estimated impact

Per-sample savings

Container launch savings

Cost projection (AWS spot, r6id.4xlarge @ $0.38/hr)

Implementation approach

Rust crate dependencies

Output validation strategy

Phased delivery

Risks and considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposal: `rustqc-align`