AML transcriptomics: single-cell + bulk RNA-seq
Two complementary takes on acute myeloid leukemia transcriptomics: a single-cell pipeline that characterizes pre-leukemic populations across patient cohorts, and a bulk RNA-seq differential-expression pipeline built as a workflow-engineering exercise. The first asks which cells drive the disease; the second asks which genes separate AML from healthy blood at the cohort level.
Single-cell RNA-seq: pre-leukemic populations
Built a reproducible scRNA-seq pipeline on 38 public AML patient samples spanning the full pipeline from raw 10X data to clinical correlation. The R-side (Seurat) workflow handles per-sample QC, normalization, anchor-based integration, and reference-guided cell-type annotation against a hematopoietic atlas to surface pre-leukemic populations. The Python-side workflow (Scanpy + CellRank) extends the analysis with diffusion pseudotime, GPCCA macrostates, and fate probabilities out of HSC-rooted trajectories, then scores PLPS/Stem11 gene signatures with decoupler and runs Kaplan-Meier survival analysis against TCGA LAML clinical data using lifelines.
Findings are surfaced through an interactive R Shiny dashboard with two views: a filterable UMAP of integrated hematopoietic lineages, and a gallery of analysis figures (integrated UMAP, cell-type abundance, macrostates, fate probabilities, metabolic pathway activity, pseudotime gene dynamics, and the two survival plots).
Platforms & Tools: R, Python, R Shiny, Seurat, Scanpy, CellRank, decoupler, lifelines, Jupyter, Conda, shinyapps.io
Source data drawn from Zeng et al., Cell Genomics (2023) and AML clinical data from NCI. Pipeline source lives in bioinformatics-public/preleukemia_analysis.
Bulk RNA-seq differential expression (Nextflow)
A compact Nextflow DSL2 pipeline for bulk RNA-seq differential expression, AML vs. healthy, run end-to-end on real public RNA-seq cohorts. AML samples come from TCGA-LAML and healthy controls from GTEx whole blood, both pulled from the recount3 project, which re-aligns and re-quantifies TCGA and GTEx through one uniform Monorail / STAR / GENCODE v26 pipeline so the gene-level counts are directly comparable across the two sources.
It’s a workflow-engineering exercise: a small, readable pipeline (channels, processes, publishDir, profile-driven config) on top of a transparent, dependency-light biology layer: library-size CPM normalization, a per-gene Welch t-test on log2-CPM, and a hand-rolled Benjamini-Hochberg FDR. The interactive volcano below labels the canonical AML markers (FLT3, KIT, MEIS1, HOXA9, MPO, CD34, …), which sit cleanly above the significance line.
Four stages:
LOAD_COUNTS: join the TCGA-LAML + GTEx gene sums on Ensembl ID, map to HGNC symbols via GENCODE v26, subsample to balanced groups, and filter low-expression genes.NORMALIZE_COUNTS: library-size CPM, thenlog2(CPM + 1).RUN_DE: per-gene Welch t-test with BH-adjusted p-values.MAKE_VOLCANO: an interactive Plotly volcano.
The real-data inputs (~130 MB from recount3 + the GENCODE annotation) are fetched once with a small fetch_real_data.sh helper, and the whole thing runs in seconds on a laptop. Pinned conda env, project-relative paths, and fast data-free unit tests for the DE math.
Comparator caveat: GTEx has no bone-marrow tissue, so whole peripheral blood is the closest large healthy comparator. The AML markers recover cleanly, but progenitor-associated genes can read as “up in AML” simply because mature blood lacks progenitor populations. Swapping in a healthy bone-marrow cohort is the natural next step.
Platforms & Tools: Nextflow DSL2, Python (numpy / pandas / scipy / plotly), recount3, GENCODE v26, conda, pytest
The pipeline source and the main.nf workflow live in bioinformatics-public/aml_rnaseq_nf; see docs/REPORT.md for a full run report: dataset provenance, the embedded volcano, and a runtime profile.