Plant bulk RNA-seq: drought tolerance differential expression (Nextflow)

A reproducible Nextflow DSL2 pipeline that takes raw Illumina bulk RNA-seq reads from two rice cultivars with opposite drought phenotypes (the tolerant landrace Apo and the susceptible variety IR64) all the way to a functional read-out of which genes, and which biological processes, distinguish the two. It’s the read-processing companion to my bulk AML pipeline: where that one starts from pre-quantified counts, this one demonstrates the full short-read path (QC and trimming, spliced genome alignment, gene-level quantification, differential expression, and GO over-representation) on real public plant data.

Data

Paired-end bulk RNA-seq from BioProject PRJNA338445 (Wilkins et al.): the drought-tolerant cultivar Apo and the drought-susceptible cultivar IR64, each sampled under control and drought-stress conditions. Reads are pulled straight from the ENA and stream-subsampled (only the leading read pairs of each run are downloaded) so the whole study runs on a laptop. The reference is the Ensembl Plants IRGSP-1.0 rice genome and annotation, and the GO gene sets are built from Ensembl Plants BioMart so their gene IDs match the annotation exactly.

Pipeline

A small, readable DSL2 workflow (channels, processes, publishDir, profile-driven config) over a standard short-read toolchain. Ten stages:

SUBSAMPLE: deterministic seqtk downsampling.
QC_TRIM: fastp adapter/quality trimming + FastQC.
HISAT2_BUILD / ALIGN: spliced HISAT2 alignment to the rice genome, sorted with samtools.
QUANTIFY: featureCounts gene-level counts.
BUILD_MATRIX: tidy counts + sample metadata.
RUN_DE: pydeseq2: DESeq2 median-of-ratios normalization and a negative-binomial Wald test on the design ~condition + genotype, so the genotype contrast (tolerant vs. susceptible) is estimated controlling for the drought treatment.
ENRICH: hypergeometric GO over-representation on the significant DE genes, with a hand-rolled Benjamini-Hochberg adjustment.
MAKE_HEATMAP / MAKE_ENRICH_PLT: the two interactive charts below.

The enriched processes are the phenotype link: they name the biology (stress response, signalling, transport) that the most differentially expressed genes belong to, connecting the transcriptome back to drought tolerance.

Result

Across ~2 M read pairs/sample (HISAT2 to IRGSP-1.0, 83-94% aligned, 38,993 genes quantified), 56 genes separate the tolerant and susceptible cultivars at padj < 0.05. Those genes are over-represented for defense response (padj 1.4 × 10⁻⁴), response to other organism, and defense response to other organism, with diterpenoid metabolic process (rice phytoalexin / momilactone biosynthesis) and photosynthesis close behind. That is the phenotype link in one line: the transcriptional gap between a drought-tolerant and a drought-susceptible rice line concentrates in stress- and defense-related biology. The heatmap below shows the two genotype blocks separating cleanly; the bar chart ranks the enriched processes.

Platforms & Tools: Nextflow DSL2, HISAT2, samtools, subread/featureCounts, fastp, seqtk, Python (pydeseq2 / pandas / numpy / scipy / plotly), Ensembl Plants, ENA, conda, pytest

The pipeline source and the main.nf workflow live in bioinformatics-public/plant_rnaseq_nf; see docs/REPORT.md for a full run report. The whole thing is reproducible with one fetch script, a pinned conda env, and an offline toy-genome mode used as the CI smoke test.

Top differentially expressed genes between the tolerant (Apo) and susceptible (IR64) cultivars, row z-scored and clustered. Genotype blocks separate cleanly.

Biological processes over-represented among the DE genes: the phenotype-facing summary, linking the differentially expressed genes to drought-relevant biology.

Scope note: reads are stream-subsampled and the two within-genotype libraries (control + stress) act as replicates for the genotype contrast, a deliberate, documented trade-off that keeps the project laptop-reproducible rather than a maximally powered study. Swapping in more biological replicates and the full read depth is a drop-in change.