JCaperella Bioinformatics and solutions Hub

Welcome

Pick a tab above to view a one-page overview of each app, including what it does, how to start, and links to the code. Each section follows the same clean card layout for quick scanning.

NEW

JC Enrichment Network Studio (Dash App)

Interactive Dash web app for exploring gene ↔ pathway/term enrichment results as a bipartite network. Upload long-format enrichment memberships, filter and explore hubs, view summary stats, and export node/edge tables for downstream analysis (e.g., Cytoscape).

📄 Input: long-format CSV (one row per gene–term membership edge)
🧬 Bipartite graph: genes ↔ pathways/terms
🎚️ Filters: search, min degree, min edge weight, max groups, layout mode
📊 Stats: node/edge counts, components, top genes/terms by degree
⚖️ Weighted edges: supports padj/FDR; auto-converts to -log10(p) for plotting
⬇️ Export: download nodes.csv and edges.csv
☁️ Live deployment: Google Cloud Run

Open Live App View on GitHub

Tip: If you have a standard enrichment table (term + list of genes), convert it to long format first (one gene per row per term). This app is designed for clean network building and fast exploration.

Bulk RNA-Seq Analyzer

Interactive Shiny app for bulk RNA-seq: differential expression, PCA/UMAP, volcano & heatmaps, enrichment, Random Forest, power analysis, and downloadable results.

🗂️ Inputs: counts matrix + phenotype CSV
🧬 DE: limma-voom workflow
🧭 Dims: PCA & UMAP
🌋 Plots: volcano, heatmap, interactive tables
🧠 ML: Random Forest + ROC/AUC
🧪 Pathways: Enrichr (KEGG/GO/Reactome)
⚡ Power: sample size/power curves
designed for HPC environments using Singularity

View on GitHub

NEW

Paired FASTQ QC (WDL + Cromwell)

Containerized WDL workflow executed with Cromwell for paired-end FASTQ QC. Runs FastQC on R1/R2 and generates a single MultiQC report across samples, plus a merged read-count table.

🧬 Inputs: paired FASTQs (R1/R2) + sample IDs via JSON
🔍 QC: FastQC per read pair
📊 Multi-sample summary: MultiQC report
🧾 Counts: merged per-sample read counts (R1/R2)
📦 Reproducible: Docker runtime inside WDL tasks

View on GitHub README

Use case: a compact, portfolio-friendly example of WDL/Cromwell workflow wiring, containerized execution, and QC aggregation across multiple samples.

Counts_matrix_Nextflow (RNA-seq Pipeline)

Portable Nextflow DSL2 RNA-seq workflow: QC → STAR alignment → gene counts → Salmon TPM → BigWig coverage → matrix merge + MultiQC summary. Designed to be minimal, readable, and robust.

📄 Inputs: paired-end FASTQs via samplesheet CSV
🧬 Core steps: fastp → STAR → featureCounts → Salmon → MultiQC
📈 Outputs: gene counts matrix, TPM matrix, BigWig tracks, QC report
🧱 Indexes: auto-builds STAR + Salmon indexes (per run)
🛡️ Robustness: skips coverage gracefully for zero-mapped samples
📦 Runs anywhere: Conda, Docker, or Singularity/Apptainer
🧪 Great for: demos, infra tests, portfolio/template pipelines

View on GitHub README

Quickstart

Docker

docker build -t rnaseq-pipeline .
nextflow run main.nf -profile docker \
  --samplesheet samples.csv \
  --ref genome.fa \
  --gtf genes.gtf \
  --transcripts transcripts.fa

Singularity/Apptainer

singularity build containers/rnaseq-pipeline.sif docker://rnaseq-pipeline
nextflow run main.nf -profile singularity \
  --samplesheet samples.csv \
  --ref genome.fa \
  --gtf genes.gtf \
  --transcripts transcripts.fa

Key Outputs

results/
├── qc/                       (fastp)
├── ref/                      (STAR + Salmon indexes)
├── bam/                      (sorted BAM + BAI + flagstat)
├── bigwig/                   (coverage tracks)
├── counts_per_sample/        (per-sample featureCounts)
├── counts_matrix.tsv         (gene count matrix)
├── salmon_tpm_matrix.tsv     (transcript TPM matrix)
└── multiqc_report.html       (summary report)

Notes: intentionally simple; explicit channels; guards against common failure modes. Easy to extend into DESeq2/edgeR downstream analysis.

NEW

Enrichment Analysis LLM Triage (Flask App)

A lightweight Flask web app that takes enrichment results (CSV) and generates a structured, human-readable triage report — highlighting likely biological drivers, reactive programs, potential confounders, and suggested follow-up experiments.

📄 Input: enrichment CSV (terms + scores + genes/overlap fields)
🧠 LLM reasoning: summarizes key programs and flags confounding patterns
🧪 Follow-ups: proposes targeted experiments with readouts + controls
📑 PDF report: generates a clean downloadable triage PDF
📦 Deploy: Docker + Apptainer/Singularity (HPC-friendly)

View on GitHub

Why it matters: enrichment tables are easy to generate but hard to interpret. This tool helps translate “significant pathways” into an actionable short list of mechanisms and concrete next experiments — without drowning the user in jargon.

NEW

Breast Cancer NLP Phenotyper (Dash + medspaCy)

A lightweight, rule-based clinical NLP dashboard for extracting key breast cancer phenotypes from free-text notes (e.g., pathology, consults). Outputs include a clean patient-level table plus auditable evidence mentions with snippets so results can be reviewed and trusted.

📄 Input: upload multiple .txt notes + optional mapping CSV (note → patient/date/type)
🧠 NLP engine: spaCy + medspaCy patterns (deterministic extraction)
🧬 Phenotypes: ER/PR status (+ % if present), HER2 (IHC/FISH → final), Ki-67, histology, grade, stage
🔎 Evidence: each extracted value is backed by mention-level snippets for auditing
🧮 Aggregation: note-type/date precedence rules roll note-level data to patient-level output
designed for HPC environments using Singularity

View on GitHub

Note: This is an MVP designed for transparency and portability — ideal for demos, iteration, and extension into richer rule sets or model-assisted extraction later.

ATAC-Seq Peak Annotation & Enrichment

Upload MACS2 .narrowPeak, annotate with ChIPseeker, and run GO/KEGG/Reactome enrichment with slick visuals and CSV exports.

📄 Input: MACS2 .narrowPeak file
🏷️ Annotation: ChIPseeker + TxDb
📊 Views: pie charts, tables, barplots
🧠 Pathways: enrichR (GO/KEGG/Reactome)
designed for HPC environments using Singularity

View on GitHub

miRNA Differential Expression & Enrichment

DESeq2-based miRNA analysis with PCA/UMAP, volcano & heatmaps, Enrichr enrichment, Random Forest classification, power analysis, and exports.

🗂️ Inputs: miRNA counts + metadata
🧬 DE: DESeq2 pipeline
🧭 Dims: PCA & UMAP
🌋 Plots: volcano, top-miRNA bars, heatmaps
🧠 ML: RF classification + metrics
🧪 Pathways: Enrichr (clusterProfiler fallback)
⚡ Power: sample size estimates
designed for HPC environments using Singularity

View on GitHub

DNA Methylation App

Explore beta values, run differential methylation, enrichment, PCA/UMAP, Random Forest, power analysis, and download everything — HPC-ready.

🗂️ Input: CSV beta matrix (e.g., 450k)
🧪 DE: probe-level stats + FDR
🧭 Dims: PCA & UMAP
🧠 ML: RF + AUC & importance
🧪 Pathways: Enrichr KEGG/GO/Reactome
⚡ Power: Cohen’s d → n per group
📦 Deploy: Singularity/Apptainer

View on GitHub

CRISPR Mixscape Pipeline (Perturb-seq)

Single-cell CRISPR screen workflow using Seurat’s Mixscape: QC/normalization, UMAP, perturbation scoring, KO/NP/NT assignment, DE, and rich plots — with HPC support.

🧪 Inputs: counts + metadata CSVs
🧭 Dims: UMAP visualization
🧮 Mixscape: perturbation scores & class labels
🧬 DE: KO vs NT + downloads
📈 Views: bar/violin/heatmaps, summaries
📦 Deploy: Singularity + Slurm script

View on GitHub

GWAS Analysis App

A full-stack, no-code Shiny app for Genome-Wide Association Studies (GWAS) using raw VCF files—no PLINK needed. Upload your VCF, phenotype, and covariate table to begin.

🧪 QC Filters: MAF, allele frequency, call rate, HWE p-value thresholds
🧮 GWAS Engine: Logistic regression with Bonferroni correction support
📊 Visualization: PCA, UMAP, QQ plot, Manhattan plot with region zoom
🤖 Machine Learning: Random Forest with AUC, importance, ROC
🧬 SNP-to-Gene Mapping: Map significant SNPs to nearest genes
🧠 Enrichment: KEGG/GO/Reactome via enrichR
⚡ Power Analysis: Cohen’s d-based observed power & sample size curve
📦 Export Everything: GWAS tables, enrichment results, ML metrics, more
designed for HPC environments using Singularity

View on GitHub

NEW

Single-cell RNA-Seq App

Interactive Seurat-based app to explore scRNA-seq data, run DE, pathway enrichment, classification, power analyses, and download publication-ready tables/plots.

🗂️ Upload: counts CSV (genes × cells/samples) + metadata CSV (matching names)
🧱 Create Seurat Object: load & normalize your data in-app
🧭 Dimensionality Reduction: PCA & UMAP for cluster/pattern visualization
🧬 Differential Expression: by condition and by cell type; find condition-only DE genes
🧠 Pathway Analysis: enrich DE gene sets across multiple databases
🌋 Volcano Plot: publication-style volcano for condition-only DE
🤖 Feature Selection & Classification: Random Forest markers, ROC, importance
📈 Power Analysis: estimate power and minimum sample size for key DE genes
🔥 Heatmaps: top features and group differences
⬇️ Downloads: export all result tables and figures
designed for HPC environments using Singularity

View on GitHub

Tips: CSV format only; counts & metadata must match by name. For best results, use quality-filtered data (see insurance_policy_script.R).

About Me

I’m John Caperella, a bioinformatics developer passionate about turning complex genomic data into usable insights. I build clear, scalable tools for single-cell RNA-seq, CRISPR Mixscape, and omics visualization in R, Python, and Shiny—helping researchers and data scientists get to answers faster.

🧬 Single-cell & CRISPR analytics (Seurat/Mixscape)
📊 Reproducible visualization apps (Docker/Singularity/HPC)
🤖 Exploring LLMs for genomics workflows

Contact Form LinkedIn GitHub Résumé (HTML)

NEW

Research Radar (PubMed + LLM Summaries)

A lightweight research intelligence dashboard that pulls recent PubMed papers across multiple queries and generates 3-bullet summaries using a local LLM (Ollama). Includes filters, trending topics, and top journals — built for fast scanning.

🧠 LLM summaries: exactly 3 bullets per abstract
📈 Dashboard: trending topics + top journals + search/filter
🔁 Update workflow: regenerate papers.json and push to Pages
🧱 Portfolio pattern: API ingestion → pipeline → structured JSON → UI

Open Research Radar

Tip: This is ideal for tracking AI-for-biology, computational methods, and variant calling literature without drowning in full abstracts.

NEW

GCP FASTQ Event Pipeline (Eventarc → Cloud Run Job)

Cloud-native, event-driven FASTQ processing on Google Cloud. Upload a FASTQ to Cloud Storage and automatically trigger a serverless pipeline: Eventarc fires on object finalize → Cloud Function launches a containerized Cloud Run Job → results are written back for downstream analysis.

☁️ Trigger: GCS “object finalize” event (Eventarc)
🧩 Orchestration: Cloud Function (glue layer)
📦 Compute: Cloud Run Job (containerized batch)
🧬 Bioinformatics-ready pattern: scalable ETL wiring for genomics ingestion
📤 Outputs: structured results written to storage (and designed to extend to analytics/warehousing)
🛠️ Great for: portfolio demos of serverless pipelines + reproducible containers

View on GitHub README

Why it matters: This mirrors real-world genomics platform patterns — automatic ingestion triggers + containerized compute, without manual job submission.

Document Cleaning CLI

AI-powered document cleanup for scanned pages, noisy screenshots, and OCR-bound records. Enhances messy source images into cleaner, sharper outputs that are easier to read, extract, and route into downstream research or clinical workflows.

📥 Inputs: scanned .png/.jpg files or ZIP batches of document images
🧼 Cleanup: deep-learning denoising + image enhancement for messy source material
📄 Outputs: OCR-optimized images and PDF-ready cleaned documents
🛠️ Modes: command-line workflow or REST API deployment
🏥 Use cases: legacy records, scanned notes, exported forms, and other hard-to-read document inputs
⚙️ designed for flexible local or API-based deployment workflows

View on GitHub

Why it matters: a lot of valuable information is trapped inside low-quality visual documents. This tool helps turn noisy records into cleaner, machine-readable assets for OCR, review, and downstream automation.

Usage Analytics

This dashboard is powered by Google Analytics 4 and Looker Studio. It shows historical click activity for each app.

Last checked: loading…