Skip to content

RustQC for nf-core/sarek: Single-Pass CRAM QC #18

@ewels

Description

@ewels

Problem

Sarek reads each sample's CRAM file 7 times independently during QC — once per tool, per checkpoint. For a 30× WGS sample (~50 GB CRAM), each pass means full decompression and iteration of every aligned read. The most expensive single process is bcftools mpileup for NGSCheckMate, which takes over 2 hours on WES despite only needing data at ~10K genomic positions.

Current QC architecture

Sarek runs QC at two or three checkpoints after alignment. At each checkpoint, multiple tools independently read the entire CRAM:

Checkpoint 1 — Post-MarkDuplicates (CRAM_QC_MOSDEPTH_SAMTOOLS)

Tool What it computes CRAM pass
samtools stats Mapping rate, insert size, base quality, GC content, error rates, duplicate %, read lengths 1 full pass
mosdepth Per-window depth (500 bp), coverage distribution, per-chromosome summary 1 full pass

Checkpoint 2 — Post-BQSR (CRAM_QC_RECAL, runs only when base recalibration is enabled)

Tool What it computes CRAM pass
samtools stats Same as above, on recalibrated CRAM 1 full pass
mosdepth Same as above, on recalibrated CRAM 1 full pass

Checkpoint 3 — Sample identity (BAM_NGSCHECKMATE within CRAM_SAMPLEQC)

Tool What it computes CRAM pass
bcftools mpileup (chr1) Genotypes at known SNPs on chromosome 1 1 full pass
bcftools mpileup (chr17) Genotypes at known SNPs on chromosome 17 1 full pass
ncm.py Pairwise sample correlation from VCFs N/A (reads VCFs)

The CRAM_SAMPLEQC subworkflow also calls CRAM_QC_RECAL (another samtools stats + mosdepth pass), but only when BQSR is enabled.

Empirical runtimes (WES, HCC1395 tumor/normal, r6id.4xlarge)

Process Normal Tumor
samtools stats (post-MarkDup) 4m 47s 6m 29s
mosdepth (post-MarkDup) 2m 23s 2m 50s
samtools stats (post-BQSR) 4m 33s 5m 26s
mosdepth (post-BQSR) 2m 03s 2m 55s
bcftools mpileup chr1 1h 49m 2h 14m
bcftools mpileup chr17 33m 40m

The mpileup steps dominate — they account for over 80% of all QC wall-clock time.

Source files involved

File What it does
subworkflows/local/cram_qc_mosdepth_samtools/main.nf Runs samtools stats + mosdepth in parallel
subworkflows/local/cram_sampleqc/main.nf Wraps CRAM_QC_RECAL + BAM_NGSCHECKMATE
subworkflows/nf-core/bam_ngscheckmate/main.nf Runs bcftools mpileup per sample, then ncm.py
modules/nf-core/bcftools/mpileup/main.nf bcftools mpileup → bcftools call → VCF
modules/nf-core/ngscheckmate/ncm/main.nf ncm.py: pairwise correlation from collected VCFs

Proposal: rustqc-align

A single Rust binary that replaces samtools stats, mosdepth, and the bcftools mpileup genotyping step by computing all three in one streaming pass over the CRAM file.

Core algorithm

INPUTS:
  - CRAM/BAM file (coordinate-sorted, indexed)
  - Reference FASTA
  - SNP BED file (~10K positions, from NGSCheckMate)
  - Optional: target BED for WES regions

STARTUP:
  Load SNP positions into a sorted per-chromosome array
  Initialise stats accumulators (alignment, depth, genotyping)

STREAMING PASS:
  for each record in CRAM:
      // --- samtools stats equivalent ---
      Update mapping counters (mapped/unmapped/duplicate/supplementary)
      Update insert size histogram (from TLEN for proper pairs)
      Update base quality distribution (per-cycle)
      Update GC content histogram (from sequence)
      Update read length distribution
      Update mismatch/error rates (from MD tag or NM tag)
      Update MAPQ distribution

      // --- mosdepth equivalent ---
      For each aligned position in [record.start, record.end]:
          Increment depth counter
      At window boundaries (every 500 bp):
          Flush window depth to output
          Update coverage distribution histogram

      // --- NGSCheckMate genotyping ---
      Check if record overlaps any loaded SNP positions
        (binary search or pointer advance — almost always zero hits)
      For each overlapping SNP:
          If MAPQ >= threshold and not duplicate:
              Resolve base at SNP position via CIGAR alignment
              If base quality >= threshold:
                  Increment ref/alt allele counter at that position

OUTPUT:
  Emit samtools stats format file (.stats)
  Emit mosdepth output files (.summary.txt, .dist.txt, .regions.bed.gz)
  Emit per-sample VCF at SNP positions (for ncm.py)

Why NGSCheckMate genotyping is essentially free

The SNP BED file contains ~10,000 positions across the ~3 billion base human genome. For a coordinate-sorted BAM, the SNP positions can be stored in a sorted array and checked with a single pointer that advances alongside the genomic position of each record.

For most reads, the check is: "is my position past the next SNP? No → skip." This is a single integer comparison per read. Only the ~0.003% of reads overlapping a SNP position require CIGAR-resolved base extraction.

The current approach runs bcftools mpileup as a separate full-genome pileup engine, building complete pileup columns at every position, computing genotype likelihoods with a probabilistic model, and piping through bcftools call. This is massively over-engineered for the purpose of extracting allele counts at 10K known positions.

Memory overhead for the genotyping component: ~10K × 16 bytes = 160 KB.

What bcftools mpileup currently does vs. what's actually needed

Feature bcftools mpileup RustQC needs
Full-genome pileup Yes No — only ~10K positions
BAQ recalculation Yes No
Genotype likelihood model (PL/GL) Yes No — just allele counts
Indel realignment Yes No
Multi-sample calling Yes No — per-sample only
VCF output with full annotation Yes Minimal VCF with GT:AD fields

Output files and compatibility

RustQC must produce output files that are compatible with two downstream consumers: MultiQC and ncm.py. No other process in the sarek pipeline consumes these QC outputs.

samtools stats output → consumed by MultiQC

The samtools stats module in MultiQC parses the SN, IS, GCF, GCL, RL, FFQ, LFQ, MPC sections of the .stats file. RustQC must emit these sections in the same tab-delimited format. This format is stable and well-documented.

mosdepth output → consumed by MultiQC

MultiQC's mosdepth module reads:

  • *.mosdepth.summary.txt — per-chromosome mean depth
  • *.mosdepth.global.dist.txt — cumulative coverage distribution
  • *.regions.bed.gz (optional) — per-region depth for WES targets

These are simple TSV files. The format is straightforward to replicate.

NGSCheckMate VCF → consumed by ncm.py

ncm.py reads standard VCF files and extracts genotype information at the SNP positions. The per-sample VCFs produced by bcftools mpileup | bcftools call are published to {outdir}/reports/ngscheckmate/vcfs/ but are not consumed by any other pipeline process. They exist purely as audit trail for the sample matching.

The VCF format required is minimal:

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    1234567 .       A       G       .       .       DP=45   GT:AD:DP  0/1:22,23:45

The ncm.py correlation algorithm fundamentally just needs allele fractions at each SNP position. The existing pipeline passes --ploidy 1 -c to bcftools call, producing haploid calls — a deliberate simplification since NGSCheckMate only needs to distinguish homozygous ref, heterozygous, and homozygous alt.

RustQC can assign GT based on simple allele fraction thresholds (e.g., <0.15 = 0/0, 0.15–0.85 = 0/1, >0.85 = 1/1) without running a proper genotype likelihood model, and ncm.py will work identically.

Pipeline integration

The integration point is clean. Currently, sarek calls two subworkflows at each checkpoint:

CRAM_QC_MOSDEPTH_SAMTOOLS(cram, fasta, intervals)
CRAM_SAMPLEQC(cram, ngscheckmate_bed, fasta, ...)

These would be replaced by a single module call:

RUSTQC_ALIGN(cram, fasta, intervals, ngscheckmate_bed)

The outputs map directly:

Current output RustQC equivalent Consumer
SAMTOOLS_STATS.out.stats rustqc_align.stats MultiQC
MOSDEPTH.out.summary rustqc_align.mosdepth.summary.txt MultiQC
MOSDEPTH.out.dist rustqc_align.mosdepth.global.dist.txt MultiQC
MOSDEPTH.out.regions rustqc_align.regions.bed.gz MultiQC
BCFTOOLS_MPILEUP.out.vcf rustqc_align.ngscheckmate.vcf.gz ncm.py, published

ncm.py continues to run as a separate downstream aggregation step — it's fast (works on small VCFs) and doesn't need replacing.

What changes per checkpoint

Checkpoint Current passes RustQC passes Saving
Post-MarkDup 2 (stats + mosdepth) 1 1 pass
Post-BQSR 2 (stats + mosdepth) 1 1 pass
Sample QC 2 (mpileup chr1 + chr17) 0 (folded into checkpoint 1 or 2) 2 passes
Total per sample 6–7 1–2 4–5 passes eliminated

The NGSCheckMate genotyping no longer needs its own checkpoint at all — it piggybacks on whichever stats checkpoint runs last (post-BQSR if enabled, post-MarkDup otherwise).


Estimated impact

Per-sample savings

Scenario Current QC time With RustQC Saved
WES (HCC1395-like) ~3.5 hours ~10 minutes ~3 hours 20 min
30× WGS ~7 hours ~45 minutes ~6 hours 15 min
60× WGS ~14 hours ~1.5 hours ~12 hours 30 min

The dramatic savings come almost entirely from eliminating the mpileup steps. The stats + mosdepth merge saves minutes; the mpileup elimination saves hours.

Container launch savings

Each eliminated process also eliminates a container launch on cloud executors. On AWS Batch, container scheduling + EBS volume attachment adds ~900 seconds of wall-clock overhead per task.

Cohort Eliminated processes Container overhead saved
100 WES samples ~500 ~125 hours wall-clock
100 WGS samples ~500 ~125 hours wall-clock
1000 WGS samples ~5000 ~1250 hours wall-clock

Cost projection (AWS spot, r6id.4xlarge @ $0.38/hr)

Cohort CPU hours saved Approx. cost saved
100 WES tumor/normal ~660 hours ~$250
100 WGS 30× ~1,250 hours ~$475
1000 WGS 30× ~12,500 hours ~$4,750

Implementation approach

Rust crate dependencies

Crate Purpose
noodles or rust-htslib CRAM/BAM reading and CIGAR parsing
noodles-fasta Reference sequence access
flate2 / bgzip Compressed output (BED, VCF)
clap CLI argument parsing

noodles is a pure-Rust htslib alternative with no C dependencies, which simplifies compilation and container builds. rust-htslib wraps the C htslib and may offer better CRAM decompression performance — benchmarking needed.

Output validation strategy

Before adoption, RustQC outputs must be validated against the existing tools on real data:

  1. Run both RustQC and the existing tools on the same CRAM
  2. Compare samtools stats output field-by-field (allow for minor floating-point differences in percentage fields)
  3. Compare mosdepth depth values per window (should be identical)
  4. Compare NGSCheckMate VCF genotypes at each SNP position
  5. Run ncm.py on both VCF sets and confirm identical correlation matrices

Phased delivery

Phase 1: samtools stats + mosdepth equivalent

Implement the streaming stats and depth computation. This is the lowest-risk component — the algorithms are well-understood accumulator patterns, and the output formats are stable TSVs. Can be deployed as a drop-in replacement at the existing checkpoints without touching the NGSCheckMate workflow.

Phase 2: Add targeted genotyping

Add the SNP position lookup and allele counting to the same pass. Emit a minimal VCF. Validate against bcftools mpileup output on a reference dataset. Once validated, the BAM_NGSCHECKMATE subworkflow's mpileup step can be removed, with its VCF input replaced by RustQC's output.

Phase 3: Optimisation and cross-pipeline rollout

Profile and optimise the hot path (CRAM decompression is likely the bottleneck). Consider multi-threaded decompression. Adapt for atacseq/chipseq (drop the genotyping component, add insert size distribution for ATAC fragment size analysis).


Risks and considerations

CRAM decompression is the real bottleneck. Both samtools and mosdepth are fast C tools — the time is dominated by CRAM decompression, not computation. A Rust tool reading the same CRAM still has to decompress it. The saving comes from doing it once instead of 6–7 times, not from Rust being faster than C at decompression.

Output format fidelity. MultiQC parsers are format-sensitive. Any deviation in the stats file format (missing sections, different column ordering, different precision) will cause MultiQC to fail silently or produce incorrect plots. Extensive regression testing against real samtools output is essential.

Edge cases in CIGAR parsing. The genotyping component needs to correctly resolve query positions from CIGAR strings, handling soft clips, insertions, deletions, and supplementary alignments. This is well-trodden ground but a source of subtle bugs.

Community adoption. samtools and mosdepth are beloved, trusted tools. Replacing them requires demonstrating bit-level output compatibility and significant performance gains. The NGSCheckMate mpileup elimination is the compelling story — "same results, 3 hours faster per sample" is easy to sell.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions