Plant bulk RNA-seq: drought tolerance differential expression (Nextflow)
A reproducible Nextflow DSL2 pipeline that takes raw Illumina bulk RNA-seq reads from two rice cultivars with opposite drought phenotypes (the tolerant landrace Apo and the susceptible variety IR64) all the way to a functional read-out of which genes, and which biological processes, distinguish the two. It’s the read-processing companion to my bulk AML pipeline: where that one starts from pre-quantified counts, this one demonstrates the full short-read path (QC and trimming, spliced genome alignment, gene-level quantification, differential expression, and GO over-representation) on real public plant data.
Data
Paired-end bulk RNA-seq from BioProject PRJNA338445 (Wilkins et al.): the drought-tolerant cultivar Apo and the drought-susceptible cultivar IR64, each sampled under control and drought-stress conditions. Reads are pulled straight from the ENA and stream-subsampled (only the leading read pairs of each run are downloaded) so the whole study runs on a laptop. The reference is the Ensembl Plants IRGSP-1.0 rice genome and annotation, and the GO gene sets are built from Ensembl Plants BioMart so their gene IDs match the annotation exactly.
Pipeline
A small, readable DSL2 workflow (channels, processes, publishDir,
profile-driven config) over a standard short-read toolchain. Ten stages:
SUBSAMPLE: deterministicseqtkdownsampling.QC_TRIM:fastpadapter/quality trimming + FastQC.HISAT2_BUILD/ALIGN: splicedHISAT2alignment to the rice genome, sorted withsamtools.QUANTIFY:featureCountsgene-level counts.BUILD_MATRIX: tidy counts + sample metadata.RUN_DE: pydeseq2: DESeq2 median-of-ratios normalization and a negative-binomial Wald test on the design~condition + genotype, so the genotype contrast (tolerant vs. susceptible) is estimated controlling for the drought treatment.ENRICH: hypergeometric GO over-representation on the significant DE genes, with a hand-rolled Benjamini-Hochberg adjustment.MAKE_HEATMAP/MAKE_ENRICH_PLT: the two interactive charts below.
The enriched processes are the phenotype link: they name the biology (stress response, signalling, transport) that the most differentially expressed genes belong to, connecting the transcriptome back to drought tolerance.
Result
Across ~2 M read pairs/sample (HISAT2 to IRGSP-1.0, 83-94% aligned, 38,993 genes quantified), 56 genes separate the tolerant and susceptible cultivars at padj < 0.05. Those genes are over-represented for defense response (padj 1.4 × 10⁻⁴), response to other organism, and defense response to other organism, with diterpenoid metabolic process (rice phytoalexin / momilactone biosynthesis) and photosynthesis close behind. That is the phenotype link in one line: the transcriptional gap between a drought-tolerant and a drought-susceptible rice line concentrates in stress- and defense-related biology. The heatmap below shows the two genotype blocks separating cleanly; the bar chart ranks the enriched processes.
Platforms & Tools: Nextflow DSL2, HISAT2, samtools, subread/featureCounts, fastp, seqtk, Python (pydeseq2 / pandas / numpy / scipy / plotly), Ensembl Plants, ENA, conda, pytest
The pipeline source and the main.nf workflow live in bioinformatics-public/plant_rnaseq_nf; see docs/REPORT.md for a full run report. The whole thing is reproducible with one fetch script, a pinned conda env, and an offline toy-genome mode used as the CI smoke test.
Scope note: reads are stream-subsampled and the two within-genotype libraries (control + stress) act as replicates for the genotype contrast, a deliberate, documented trade-off that keeps the project laptop-reproducible rather than a maximally powered study. Swapping in more biological replicates and the full read depth is a drop-in change.