Problem
Sarek reads each sample's CRAM file 7 times independently during QC — once per tool, per checkpoint. For a 30× WGS sample (~50 GB CRAM), each pass means full decompression and iteration of every aligned read. The most expensive single process is bcftools mpileup for NGSCheckMate, which takes over 2 hours on WES despite only needing data at ~10K genomic positions.
Current QC architecture
Sarek runs QC at two or three checkpoints after alignment. At each checkpoint, multiple tools independently read the entire CRAM:
Checkpoint 1 — Post-MarkDuplicates (CRAM_QC_MOSDEPTH_SAMTOOLS)
| Tool |
What it computes |
CRAM pass |
samtools stats |
Mapping rate, insert size, base quality, GC content, error rates, duplicate %, read lengths |
1 full pass |
mosdepth |
Per-window depth (500 bp), coverage distribution, per-chromosome summary |
1 full pass |
Checkpoint 2 — Post-BQSR (CRAM_QC_RECAL, runs only when base recalibration is enabled)
| Tool |
What it computes |
CRAM pass |
samtools stats |
Same as above, on recalibrated CRAM |
1 full pass |
mosdepth |
Same as above, on recalibrated CRAM |
1 full pass |
Checkpoint 3 — Sample identity (BAM_NGSCHECKMATE within CRAM_SAMPLEQC)
| Tool |
What it computes |
CRAM pass |
bcftools mpileup (chr1) |
Genotypes at known SNPs on chromosome 1 |
1 full pass |
bcftools mpileup (chr17) |
Genotypes at known SNPs on chromosome 17 |
1 full pass |
ncm.py |
Pairwise sample correlation from VCFs |
N/A (reads VCFs) |
The CRAM_SAMPLEQC subworkflow also calls CRAM_QC_RECAL (another samtools stats + mosdepth pass), but only when BQSR is enabled.
Empirical runtimes (WES, HCC1395 tumor/normal, r6id.4xlarge)
| Process |
Normal |
Tumor |
| samtools stats (post-MarkDup) |
4m 47s |
6m 29s |
| mosdepth (post-MarkDup) |
2m 23s |
2m 50s |
| samtools stats (post-BQSR) |
4m 33s |
5m 26s |
| mosdepth (post-BQSR) |
2m 03s |
2m 55s |
| bcftools mpileup chr1 |
1h 49m |
2h 14m |
| bcftools mpileup chr17 |
33m |
40m |
The mpileup steps dominate — they account for over 80% of all QC wall-clock time.
Source files involved
| File |
What it does |
subworkflows/local/cram_qc_mosdepth_samtools/main.nf |
Runs samtools stats + mosdepth in parallel |
subworkflows/local/cram_sampleqc/main.nf |
Wraps CRAM_QC_RECAL + BAM_NGSCHECKMATE |
subworkflows/nf-core/bam_ngscheckmate/main.nf |
Runs bcftools mpileup per sample, then ncm.py |
modules/nf-core/bcftools/mpileup/main.nf |
bcftools mpileup → bcftools call → VCF |
modules/nf-core/ngscheckmate/ncm/main.nf |
ncm.py: pairwise correlation from collected VCFs |
Proposal: rustqc-align
A single Rust binary that replaces samtools stats, mosdepth, and the bcftools mpileup genotyping step by computing all three in one streaming pass over the CRAM file.
Core algorithm
INPUTS:
- CRAM/BAM file (coordinate-sorted, indexed)
- Reference FASTA
- SNP BED file (~10K positions, from NGSCheckMate)
- Optional: target BED for WES regions
STARTUP:
Load SNP positions into a sorted per-chromosome array
Initialise stats accumulators (alignment, depth, genotyping)
STREAMING PASS:
for each record in CRAM:
// --- samtools stats equivalent ---
Update mapping counters (mapped/unmapped/duplicate/supplementary)
Update insert size histogram (from TLEN for proper pairs)
Update base quality distribution (per-cycle)
Update GC content histogram (from sequence)
Update read length distribution
Update mismatch/error rates (from MD tag or NM tag)
Update MAPQ distribution
// --- mosdepth equivalent ---
For each aligned position in [record.start, record.end]:
Increment depth counter
At window boundaries (every 500 bp):
Flush window depth to output
Update coverage distribution histogram
// --- NGSCheckMate genotyping ---
Check if record overlaps any loaded SNP positions
(binary search or pointer advance — almost always zero hits)
For each overlapping SNP:
If MAPQ >= threshold and not duplicate:
Resolve base at SNP position via CIGAR alignment
If base quality >= threshold:
Increment ref/alt allele counter at that position
OUTPUT:
Emit samtools stats format file (.stats)
Emit mosdepth output files (.summary.txt, .dist.txt, .regions.bed.gz)
Emit per-sample VCF at SNP positions (for ncm.py)
Why NGSCheckMate genotyping is essentially free
The SNP BED file contains ~10,000 positions across the ~3 billion base human genome. For a coordinate-sorted BAM, the SNP positions can be stored in a sorted array and checked with a single pointer that advances alongside the genomic position of each record.
For most reads, the check is: "is my position past the next SNP? No → skip." This is a single integer comparison per read. Only the ~0.003% of reads overlapping a SNP position require CIGAR-resolved base extraction.
The current approach runs bcftools mpileup as a separate full-genome pileup engine, building complete pileup columns at every position, computing genotype likelihoods with a probabilistic model, and piping through bcftools call. This is massively over-engineered for the purpose of extracting allele counts at 10K known positions.
Memory overhead for the genotyping component: ~10K × 16 bytes = 160 KB.
What bcftools mpileup currently does vs. what's actually needed
| Feature |
bcftools mpileup |
RustQC needs |
| Full-genome pileup |
Yes |
No — only ~10K positions |
| BAQ recalculation |
Yes |
No |
| Genotype likelihood model (PL/GL) |
Yes |
No — just allele counts |
| Indel realignment |
Yes |
No |
| Multi-sample calling |
Yes |
No — per-sample only |
| VCF output with full annotation |
Yes |
Minimal VCF with GT:AD fields |
Output files and compatibility
RustQC must produce output files that are compatible with two downstream consumers: MultiQC and ncm.py. No other process in the sarek pipeline consumes these QC outputs.
samtools stats output → consumed by MultiQC
The samtools stats module in MultiQC parses the SN, IS, GCF, GCL, RL, FFQ, LFQ, MPC sections of the .stats file. RustQC must emit these sections in the same tab-delimited format. This format is stable and well-documented.
mosdepth output → consumed by MultiQC
MultiQC's mosdepth module reads:
*.mosdepth.summary.txt — per-chromosome mean depth
*.mosdepth.global.dist.txt — cumulative coverage distribution
*.regions.bed.gz (optional) — per-region depth for WES targets
These are simple TSV files. The format is straightforward to replicate.
NGSCheckMate VCF → consumed by ncm.py
ncm.py reads standard VCF files and extracts genotype information at the SNP positions. The per-sample VCFs produced by bcftools mpileup | bcftools call are published to {outdir}/reports/ngscheckmate/vcfs/ but are not consumed by any other pipeline process. They exist purely as audit trail for the sample matching.
The VCF format required is minimal:
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chr1 1234567 . A G . . DP=45 GT:AD:DP 0/1:22,23:45
The ncm.py correlation algorithm fundamentally just needs allele fractions at each SNP position. The existing pipeline passes --ploidy 1 -c to bcftools call, producing haploid calls — a deliberate simplification since NGSCheckMate only needs to distinguish homozygous ref, heterozygous, and homozygous alt.
RustQC can assign GT based on simple allele fraction thresholds (e.g., <0.15 = 0/0, 0.15–0.85 = 0/1, >0.85 = 1/1) without running a proper genotype likelihood model, and ncm.py will work identically.
Pipeline integration
The integration point is clean. Currently, sarek calls two subworkflows at each checkpoint:
CRAM_QC_MOSDEPTH_SAMTOOLS(cram, fasta, intervals)
CRAM_SAMPLEQC(cram, ngscheckmate_bed, fasta, ...)
These would be replaced by a single module call:
RUSTQC_ALIGN(cram, fasta, intervals, ngscheckmate_bed)
The outputs map directly:
| Current output |
RustQC equivalent |
Consumer |
SAMTOOLS_STATS.out.stats |
rustqc_align.stats |
MultiQC |
MOSDEPTH.out.summary |
rustqc_align.mosdepth.summary.txt |
MultiQC |
MOSDEPTH.out.dist |
rustqc_align.mosdepth.global.dist.txt |
MultiQC |
MOSDEPTH.out.regions |
rustqc_align.regions.bed.gz |
MultiQC |
BCFTOOLS_MPILEUP.out.vcf |
rustqc_align.ngscheckmate.vcf.gz |
ncm.py, published |
ncm.py continues to run as a separate downstream aggregation step — it's fast (works on small VCFs) and doesn't need replacing.
What changes per checkpoint
| Checkpoint |
Current passes |
RustQC passes |
Saving |
| Post-MarkDup |
2 (stats + mosdepth) |
1 |
1 pass |
| Post-BQSR |
2 (stats + mosdepth) |
1 |
1 pass |
| Sample QC |
2 (mpileup chr1 + chr17) |
0 (folded into checkpoint 1 or 2) |
2 passes |
| Total per sample |
6–7 |
1–2 |
4–5 passes eliminated |
The NGSCheckMate genotyping no longer needs its own checkpoint at all — it piggybacks on whichever stats checkpoint runs last (post-BQSR if enabled, post-MarkDup otherwise).
Estimated impact
Per-sample savings
| Scenario |
Current QC time |
With RustQC |
Saved |
| WES (HCC1395-like) |
~3.5 hours |
~10 minutes |
~3 hours 20 min |
| 30× WGS |
~7 hours |
~45 minutes |
~6 hours 15 min |
| 60× WGS |
~14 hours |
~1.5 hours |
~12 hours 30 min |
The dramatic savings come almost entirely from eliminating the mpileup steps. The stats + mosdepth merge saves minutes; the mpileup elimination saves hours.
Container launch savings
Each eliminated process also eliminates a container launch on cloud executors. On AWS Batch, container scheduling + EBS volume attachment adds ~900 seconds of wall-clock overhead per task.
| Cohort |
Eliminated processes |
Container overhead saved |
| 100 WES samples |
~500 |
~125 hours wall-clock |
| 100 WGS samples |
~500 |
~125 hours wall-clock |
| 1000 WGS samples |
~5000 |
~1250 hours wall-clock |
Cost projection (AWS spot, r6id.4xlarge @ $0.38/hr)
| Cohort |
CPU hours saved |
Approx. cost saved |
| 100 WES tumor/normal |
~660 hours |
~$250 |
| 100 WGS 30× |
~1,250 hours |
~$475 |
| 1000 WGS 30× |
~12,500 hours |
~$4,750 |
Implementation approach
Rust crate dependencies
| Crate |
Purpose |
noodles or rust-htslib |
CRAM/BAM reading and CIGAR parsing |
noodles-fasta |
Reference sequence access |
flate2 / bgzip |
Compressed output (BED, VCF) |
clap |
CLI argument parsing |
noodles is a pure-Rust htslib alternative with no C dependencies, which simplifies compilation and container builds. rust-htslib wraps the C htslib and may offer better CRAM decompression performance — benchmarking needed.
Output validation strategy
Before adoption, RustQC outputs must be validated against the existing tools on real data:
- Run both RustQC and the existing tools on the same CRAM
- Compare samtools stats output field-by-field (allow for minor floating-point differences in percentage fields)
- Compare mosdepth depth values per window (should be identical)
- Compare NGSCheckMate VCF genotypes at each SNP position
- Run
ncm.py on both VCF sets and confirm identical correlation matrices
Phased delivery
Phase 1: samtools stats + mosdepth equivalent
Implement the streaming stats and depth computation. This is the lowest-risk component — the algorithms are well-understood accumulator patterns, and the output formats are stable TSVs. Can be deployed as a drop-in replacement at the existing checkpoints without touching the NGSCheckMate workflow.
Phase 2: Add targeted genotyping
Add the SNP position lookup and allele counting to the same pass. Emit a minimal VCF. Validate against bcftools mpileup output on a reference dataset. Once validated, the BAM_NGSCHECKMATE subworkflow's mpileup step can be removed, with its VCF input replaced by RustQC's output.
Phase 3: Optimisation and cross-pipeline rollout
Profile and optimise the hot path (CRAM decompression is likely the bottleneck). Consider multi-threaded decompression. Adapt for atacseq/chipseq (drop the genotyping component, add insert size distribution for ATAC fragment size analysis).
Risks and considerations
CRAM decompression is the real bottleneck. Both samtools and mosdepth are fast C tools — the time is dominated by CRAM decompression, not computation. A Rust tool reading the same CRAM still has to decompress it. The saving comes from doing it once instead of 6–7 times, not from Rust being faster than C at decompression.
Output format fidelity. MultiQC parsers are format-sensitive. Any deviation in the stats file format (missing sections, different column ordering, different precision) will cause MultiQC to fail silently or produce incorrect plots. Extensive regression testing against real samtools output is essential.
Edge cases in CIGAR parsing. The genotyping component needs to correctly resolve query positions from CIGAR strings, handling soft clips, insertions, deletions, and supplementary alignments. This is well-trodden ground but a source of subtle bugs.
Community adoption. samtools and mosdepth are beloved, trusted tools. Replacing them requires demonstrating bit-level output compatibility and significant performance gains. The NGSCheckMate mpileup elimination is the compelling story — "same results, 3 hours faster per sample" is easy to sell.
Problem
Sarek reads each sample's CRAM file 7 times independently during QC — once per tool, per checkpoint. For a 30× WGS sample (~50 GB CRAM), each pass means full decompression and iteration of every aligned read. The most expensive single process is
bcftools mpileupfor NGSCheckMate, which takes over 2 hours on WES despite only needing data at ~10K genomic positions.Current QC architecture
Sarek runs QC at two or three checkpoints after alignment. At each checkpoint, multiple tools independently read the entire CRAM:
Checkpoint 1 — Post-MarkDuplicates (
CRAM_QC_MOSDEPTH_SAMTOOLS)samtools statsmosdepthCheckpoint 2 — Post-BQSR (
CRAM_QC_RECAL, runs only when base recalibration is enabled)samtools statsmosdepthCheckpoint 3 — Sample identity (
BAM_NGSCHECKMATEwithinCRAM_SAMPLEQC)bcftools mpileup(chr1)bcftools mpileup(chr17)ncm.pyThe
CRAM_SAMPLEQCsubworkflow also callsCRAM_QC_RECAL(anothersamtools stats+mosdepthpass), but only when BQSR is enabled.Empirical runtimes (WES, HCC1395 tumor/normal, r6id.4xlarge)
The mpileup steps dominate — they account for over 80% of all QC wall-clock time.
Source files involved
subworkflows/local/cram_qc_mosdepth_samtools/main.nfsubworkflows/local/cram_sampleqc/main.nfsubworkflows/nf-core/bam_ngscheckmate/main.nfmodules/nf-core/bcftools/mpileup/main.nfmodules/nf-core/ngscheckmate/ncm/main.nfProposal:
rustqc-alignA single Rust binary that replaces
samtools stats,mosdepth, and thebcftools mpileupgenotyping step by computing all three in one streaming pass over the CRAM file.Core algorithm
Why NGSCheckMate genotyping is essentially free
The SNP BED file contains ~10,000 positions across the ~3 billion base human genome. For a coordinate-sorted BAM, the SNP positions can be stored in a sorted array and checked with a single pointer that advances alongside the genomic position of each record.
For most reads, the check is: "is my position past the next SNP? No → skip." This is a single integer comparison per read. Only the ~0.003% of reads overlapping a SNP position require CIGAR-resolved base extraction.
The current approach runs
bcftools mpileupas a separate full-genome pileup engine, building complete pileup columns at every position, computing genotype likelihoods with a probabilistic model, and piping throughbcftools call. This is massively over-engineered for the purpose of extracting allele counts at 10K known positions.Memory overhead for the genotyping component: ~10K × 16 bytes = 160 KB.
What bcftools mpileup currently does vs. what's actually needed
Output files and compatibility
RustQC must produce output files that are compatible with two downstream consumers: MultiQC and
ncm.py. No other process in the sarek pipeline consumes these QC outputs.samtools stats output → consumed by MultiQC
The samtools stats module in MultiQC parses the
SN,IS,GCF,GCL,RL,FFQ,LFQ,MPCsections of the.statsfile. RustQC must emit these sections in the same tab-delimited format. This format is stable and well-documented.mosdepth output → consumed by MultiQC
MultiQC's mosdepth module reads:
*.mosdepth.summary.txt— per-chromosome mean depth*.mosdepth.global.dist.txt— cumulative coverage distribution*.regions.bed.gz(optional) — per-region depth for WES targetsThese are simple TSV files. The format is straightforward to replicate.
NGSCheckMate VCF → consumed by
ncm.pyncm.pyreads standard VCF files and extracts genotype information at the SNP positions. The per-sample VCFs produced bybcftools mpileup | bcftools callare published to{outdir}/reports/ngscheckmate/vcfs/but are not consumed by any other pipeline process. They exist purely as audit trail for the sample matching.The VCF format required is minimal:
The
ncm.pycorrelation algorithm fundamentally just needs allele fractions at each SNP position. The existing pipeline passes--ploidy 1 -ctobcftools call, producing haploid calls — a deliberate simplification since NGSCheckMate only needs to distinguish homozygous ref, heterozygous, and homozygous alt.RustQC can assign GT based on simple allele fraction thresholds (e.g., <0.15 = 0/0, 0.15–0.85 = 0/1, >0.85 = 1/1) without running a proper genotype likelihood model, and
ncm.pywill work identically.Pipeline integration
The integration point is clean. Currently, sarek calls two subworkflows at each checkpoint:
These would be replaced by a single module call:
The outputs map directly:
SAMTOOLS_STATS.out.statsrustqc_align.statsMOSDEPTH.out.summaryrustqc_align.mosdepth.summary.txtMOSDEPTH.out.distrustqc_align.mosdepth.global.dist.txtMOSDEPTH.out.regionsrustqc_align.regions.bed.gzBCFTOOLS_MPILEUP.out.vcfrustqc_align.ngscheckmate.vcf.gzncm.pycontinues to run as a separate downstream aggregation step — it's fast (works on small VCFs) and doesn't need replacing.What changes per checkpoint
The NGSCheckMate genotyping no longer needs its own checkpoint at all — it piggybacks on whichever stats checkpoint runs last (post-BQSR if enabled, post-MarkDup otherwise).
Estimated impact
Per-sample savings
The dramatic savings come almost entirely from eliminating the mpileup steps. The stats + mosdepth merge saves minutes; the mpileup elimination saves hours.
Container launch savings
Each eliminated process also eliminates a container launch on cloud executors. On AWS Batch, container scheduling + EBS volume attachment adds ~900 seconds of wall-clock overhead per task.
Cost projection (AWS spot, r6id.4xlarge @ $0.38/hr)
Implementation approach
Rust crate dependencies
noodlesorrust-htslibnoodles-fastaflate2/bgzipclapnoodlesis a pure-Rust htslib alternative with no C dependencies, which simplifies compilation and container builds.rust-htslibwraps the C htslib and may offer better CRAM decompression performance — benchmarking needed.Output validation strategy
Before adoption, RustQC outputs must be validated against the existing tools on real data:
ncm.pyon both VCF sets and confirm identical correlation matricesPhased delivery
Phase 1:
samtools stats+mosdepthequivalentImplement the streaming stats and depth computation. This is the lowest-risk component — the algorithms are well-understood accumulator patterns, and the output formats are stable TSVs. Can be deployed as a drop-in replacement at the existing checkpoints without touching the NGSCheckMate workflow.
Phase 2: Add targeted genotyping
Add the SNP position lookup and allele counting to the same pass. Emit a minimal VCF. Validate against
bcftools mpileupoutput on a reference dataset. Once validated, theBAM_NGSCHECKMATEsubworkflow's mpileup step can be removed, with its VCF input replaced by RustQC's output.Phase 3: Optimisation and cross-pipeline rollout
Profile and optimise the hot path (CRAM decompression is likely the bottleneck). Consider multi-threaded decompression. Adapt for atacseq/chipseq (drop the genotyping component, add insert size distribution for ATAC fragment size analysis).
Risks and considerations
CRAM decompression is the real bottleneck. Both samtools and mosdepth are fast C tools — the time is dominated by CRAM decompression, not computation. A Rust tool reading the same CRAM still has to decompress it. The saving comes from doing it once instead of 6–7 times, not from Rust being faster than C at decompression.
Output format fidelity. MultiQC parsers are format-sensitive. Any deviation in the stats file format (missing sections, different column ordering, different precision) will cause MultiQC to fail silently or produce incorrect plots. Extensive regression testing against real samtools output is essential.
Edge cases in CIGAR parsing. The genotyping component needs to correctly resolve query positions from CIGAR strings, handling soft clips, insertions, deletions, and supplementary alignments. This is well-trodden ground but a source of subtle bugs.
Community adoption. samtools and mosdepth are beloved, trusted tools. Replacing them requires demonstrating bit-level output compatibility and significant performance gains. The NGSCheckMate mpileup elimination is the compelling story — "same results, 3 hours faster per sample" is easy to sell.