Germline variant calling: GATK short-variant pipeline (Nextflow)

A reproducible Nextflow DSL2 pipeline that takes raw Illumina short reads through the GATK germline short-variant best-practice path to a filtered, annotated cohort VCF. It is the DNA-variant member of my bioinformatics portfolio: where the bulk RNA-seq and spatial pipelines quantify expression, this one genotypes SNVs and short indels from the genome itself — the most common task in clinical and population genomics.

Data

The offline path runs on a synthetic diploid cohort — six samples in two groups, with a planted set of germline variants (shared across the cohort plus group-specific) written to a ground-truth VCF — so the whole DAG, including a recovery self-check, runs in under a minute with no download. The real target is the Genome in a Bottle sample HG002 (NIST / NA24385): a high-confidence slice of GRCh38 chr20, with reads range-sliced straight out of GIAB’s chr20 300× BAM over HTTPS (only the bytes for the target window are fetched), benchmarked against the NIST v4.2.1 high-confidence truth VCF and confident-region BED. The reference is GRCh38 chr20 (UCSC) and the annotation is GENCODE v46.

Pipeline

A small, readable DSL2 workflow (channels, processes, publishDir, profile-driven config) over the standard germline toolchain. Twelve stages:

SUBSAMPLE: deterministic seqtk downsampling.
QC_TRIM: fastp adapter/quality trimming.
PREP_REFERENCE: samtools faidx + GATK sequence dictionary + bwa index.
ALIGN: bwa mem with per-sample read groups, sorted with samtools.
MARK_DUPLICATES: gatk MarkDuplicates.
CALL_VARIANTS: gatk HaplotypeCaller in GVCF mode per sample (local haplotype reassembly).
JOINT_GENOTYPE: CombineGVCFs + GenotypeGVCFs — the GATK best-practice cohort call.
FILTER_VARIANTS: gatk VariantFiltration hard filters on separate SNP and indel tracks (the documented substitute for VQSR on small cohorts).
NORMALIZE: bcftools norm splits multiallelics and left-aligns indels.
ANNOTATE: a hand-rolled genic-consequence annotator (exonic / intronic / intergenic) that also tags SNVs as transitions or transversions.
MAKE_OVERVIEW / MAKE_GENO_HEATMAP: the two interactive charts below.

Result

On the synthetic cohort the pipeline achieves perfect recovery of the planted variants — 24/24 sites at recall = precision = 1.00, with genotype concordance 1.00 across all 144 sample-genotypes and a cohort Ts/Tv of 2.17 — and the Nextflow and run_local.sh paths produce an identical normalized callset. The self-check (check_truth.py) normalizes the called VCF and the planted truth with bcftools and asserts recall, precision, and genotype concordance pass threshold — the variant-calling analog of the planted-DE-gene and planted-cell-type checks in the sibling pipelines. The overview chart breaks each sample’s calls down by genic region; the genotype heatmap shows the planted group structure separating cleanly into group-A and group-B variant blocks.

Platforms & Tools: Nextflow DSL2, BWA-MEM, samtools, GATK4 (MarkDuplicates / HaplotypeCaller / GenotypeGVCFs / VariantFiltration), bcftools, fastp, seqtk, Python (pandas / numpy / scipy / plotly), GIAB / NIST, GENCODE, conda, pytest

The pipeline source and the main.nf workflow live in bioinformatics-public/variant_calling_nf; see docs/REPORT.md for a full run report. The whole thing is reproducible with one fetch script, a pinned conda env, and an offline synthetic mode that self-checks variant recovery and acts as the validation path.

Alt-allele dosage of the most discriminating variants across the synthetic cohort. The planted group-specific variants partition cleanly into group-A and group-B blocks; shared variants are present across all samples.

Per-sample PASS-variant counts split by genic region (exonic / intronic / intergenic), with the cohort transition/transversion ratio — a standard germline callset-QC read-out.

Scope note: the headline metrics are from the offline synthetic cohort, where ground truth is exact. The real GIAB HG002 run is fully wired and the data sources verified (fetch_real_data.sh), benchmarked with bcftools isec against the NIST truth, but the multi-GB-adjacent fetch is run on demand rather than in CI. Filtering is GATK hard-filters rather than VQSR/CNN (which need large cohorts or pretrained models), and somatic calling (Mutect2) is a natural drop-in sibling — both deliberate, documented trade-offs that keep the project laptop-reproducible.