Germline variant calling: GATK short-variant pipeline (Nextflow)
A reproducible Nextflow DSL2 pipeline that takes raw Illumina short reads through the GATK germline short-variant best-practice path to a filtered, annotated cohort VCF. It is the DNA-variant member of my bioinformatics portfolio: where the bulk RNA-seq and spatial pipelines quantify expression, this one genotypes SNVs and short indels from the genome itself — the most common task in clinical and population genomics.
Data
The offline path runs on a synthetic diploid cohort — six samples in two groups, with a planted set of germline variants (shared across the cohort plus group-specific) written to a ground-truth VCF — so the whole DAG, including a recovery self-check, runs in under a minute with no download. The real target is the Genome in a Bottle sample HG002 (NIST / NA24385): a high-confidence slice of GRCh38 chr20, with reads range-sliced straight out of GIAB’s chr20 300× BAM over HTTPS (only the bytes for the target window are fetched), benchmarked against the NIST v4.2.1 high-confidence truth VCF and confident-region BED. The reference is GRCh38 chr20 (UCSC) and the annotation is GENCODE v46.
Pipeline
A small, readable DSL2 workflow (channels, processes, publishDir,
profile-driven config) over the standard germline toolchain. Twelve stages:
SUBSAMPLE: deterministicseqtkdownsampling.QC_TRIM:fastpadapter/quality trimming.PREP_REFERENCE:samtools faidx+ GATK sequence dictionary +bwa index.ALIGN:bwa memwith per-sample read groups, sorted withsamtools.MARK_DUPLICATES:gatk MarkDuplicates.CALL_VARIANTS:gatk HaplotypeCallerin GVCF mode per sample (local haplotype reassembly).JOINT_GENOTYPE:CombineGVCFs+GenotypeGVCFs— the GATK best-practice cohort call.FILTER_VARIANTS:gatk VariantFiltrationhard filters on separate SNP and indel tracks (the documented substitute for VQSR on small cohorts).NORMALIZE:bcftools normsplits multiallelics and left-aligns indels.ANNOTATE: a hand-rolled genic-consequence annotator (exonic / intronic / intergenic) that also tags SNVs as transitions or transversions.MAKE_OVERVIEW/MAKE_GENO_HEATMAP: the two interactive charts below.
Result
On the synthetic cohort the pipeline achieves perfect recovery of the
planted variants — 24/24 sites at recall = precision = 1.00, with genotype
concordance 1.00 across all 144 sample-genotypes and a cohort Ts/Tv of
2.17 — and the Nextflow and run_local.sh paths produce an identical
normalized callset. The self-check (check_truth.py) normalizes the called VCF
and the planted truth with bcftools and asserts recall, precision, and
genotype concordance pass threshold — the variant-calling analog of the
planted-DE-gene and planted-cell-type checks in the sibling pipelines. The
overview chart breaks each sample’s calls down by genic region; the genotype
heatmap shows the planted group structure separating cleanly into group-A and
group-B variant blocks.
Platforms & Tools: Nextflow DSL2, BWA-MEM, samtools, GATK4 (MarkDuplicates / HaplotypeCaller / GenotypeGVCFs / VariantFiltration), bcftools, fastp, seqtk, Python (pandas / numpy / scipy / plotly), GIAB / NIST, GENCODE, conda, pytest
The pipeline source and the main.nf workflow live in bioinformatics-public/variant_calling_nf; see docs/REPORT.md for a full run report. The whole thing is reproducible with one fetch script, a pinned conda env, and an offline synthetic mode that self-checks variant recovery and acts as the validation path.
Scope note: the headline metrics are from the offline synthetic cohort, where ground truth is exact. The real GIAB HG002 run is fully wired and the data sources verified (
fetch_real_data.sh), benchmarked withbcftools isecagainst the NIST truth, but the multi-GB-adjacent fetch is run on demand rather than in CI. Filtering is GATK hard-filters rather than VQSR/CNN (which need large cohorts or pretrained models), and somatic calling (Mutect2) is a natural drop-in sibling — both deliberate, documented trade-offs that keep the project laptop-reproducible.