Research framework · Pilot brief

An Experimentation Layer for Multi-Modal Biomedical Machine Learning

Built during MITx Micromasters (Statistics & Data Science) and Harvard ALM (Data Science) · integrates Hetionet, Open Targets, ChEMBL, TCGA, GTEx, MIMIC-IV, AlphaFold, and PubMed-class corpora.

SNPTX organizes pipeline execution, comparative evaluation, and downstream analytical extension through artifact-defined stage boundaries and contract-validated extensions. The current build covers eight modality families — tabular, omics, graph, imaging, NLP, single-cell, molecular-graph drug discovery, and late-fusion ensemble — across 46 datasets.

1,037

Experiments / hour

8 / 46

Modalities / datasets

98.0%

ChEMBL drug GCN

+0.2pp fusion vs best unimodal

104 / 1,100

Discoveries / campaign

• 1,357 tests • 0 ruff • 0 pyright • Snakemake + DVC + MLflow • deterministic seeding · k-fold leakage gates • 5 implemented + 10 theory-grounded • evidence-bounded across 6 dimensions

▸ Named datasets per modality

Knowledge graphs & pathways: hetionet, open_targets, gene_ontology, reactome_pathways (planned: string_human, primekg)

Molecular graphs & structure: chembl_subset, moleculenet_tox21, alphafold_targets

Omics — transcriptomics & eQTL: tcga_pancan, gtex_v8, gtex_eqtl

Omics — proteomics & metabolomics: cptac, pride_sample, metabolights_mtbls1

Single-cell & spatial: cellxgene_pbmc, tenx_pbmc_3k, visium_brain

Variants & functional genomics: clinvar, thousand_genomes, gnomad, depmap, gwas_catalog

Clinical & pharmacovigilance: synthea_readmission, mimic_iv, eicu_crd, cms_synpuf, faers (planned: aact, omop_synthea)

Text, imaging, time-series: mtsamples, pubmed_sample, mitbih (planned: mimic_cxr, chexpert_small, camelyon17_sample, mimic_notes, hmp2_ibd)

DVC-versioned (s3://snptx-dvc-prod, gs://snptx-dvc-storage); ingested via the adapter-based build_dataset_generic rule.

Color encoding — how to read the chips

• green · operational today • orange · pre-pilot work • blue · neutral fact • purple · differentiator

Capability surface (4) — autonomous engine, extensions, deployment, benchmark surface

★ Lead capability

Capability

Autonomous experimentation

Bayesian optimization daemon (ExperimentEngine): GP surrogates over hyperparameter space, expected-improvement plus value-of-information acquisition, SPRT stopping (α=β=0.05), DuckDB experiment catalog. Validated at 1,037 experiments / hour. Outputs: ranked next-run queue, surrogate-calibration trace, novelty score per candidate.

• Operational today

Capability

Contract-validated extensions

Seven extension modules under a YAML-contract execution layer; four Tier-1 extensions are contract-validated and stable today. Schema convergence (v1 / v2) is pre-pilot work, scoped and tracked.

• 4 stable · v1/v2 pre-pilot

Capability

Deployment pathway

FastAPI inference (tabular, fusion, batch), RBAC with HMAC-signed tokens, hash-chained audit log (21 CFR Part 11 posture), customer-hosted via Terraform and Helm.

• Available on request

Capability

Cross-lab benchmark surface

A shared evaluation surface for graph-ML & foundation-model methods: leakage-free splits, calibration-corrected leaderboards, lineage-stamped checkpoints, contract-validated reports. Designed so two labs can compare results on identical evaluation manifests.

• Operational today

See evidence & validation Request pilot scoping Read the limitations Download brief (print → PDF)

What SNPTX claims

Reproducible multi-modal pipelines, contract-validated extensions, calibrated benchmarks, and an autonomous experimentation surface — validated across 8 axes through versioned artifacts.

What SNPTX does not claim

Diagnostic, prescribing, or autonomous-clinical capability. No replacement for wet-lab validation, IRB, or regulatory review. See Limitations.

Pilot in 4 steps

Proposed pilot

A 6–10 week, jointly-authored run on one bounded research workflow — one lab-specific data adapter (mapping your expression matrices, KG triples, or imaging manifests to the stage-1 contract), one evaluation package built on Tier-1 validated components (calibration-corrected leaderboard, leakage-free comparison, contract-validation report), reproducible from a tagged commit. Deployment hardening only where the pilot requires it.

1 · Scope

Together we pick one workflow, one dataset family, and one evaluation package. The pilot succeeds or fails on a single, agreed objective.

2 · Adapter

A schema adapter maps your inputs — expression matrices, KG triples, imaging manifests, or clinical tabular — to the stage-1 contract. Data stays in your environment.

3 · Run

The pipeline executes under Snakemake + DVC with MLflow tracking. The autonomous engine (GP + SPRT + VoI) is opt-in for hyperparameter search.

4 · Evaluation package

Tier-1 extensions emit auditable artifacts — calibration, metric aggregation, evaluation summary, contract checks — reviewed jointly with your team.

What you receive

Concrete artifacts

▸Reproducible pipeline. Snakemake DAG + DVC-versioned data + MLflow run history; n=3 runs are hash-equal under fixed seeds.
▸Evaluation package. Calibration + metric aggregation + leakage-free comparison + contract-validation report.
▸Lineage manifest. Per-checkpoint data hash, model hash, commit, and config — suitable for a reproducibility appendix.
▸Hash-chained audit log. 21 CFR Part 11 posture; every stage transition is independently verifiable.
▸Joint write-up. Methods + results section drafted with your team, ready for appendix or preprint.

Verify in 5 minutes — reproduce the 98.0% ChEMBL result locally

Verify in 5 minutes

Performance spectrum · 99.3% imaging · 92.8% omics · 98.5% drug-GCN — full results →

Reproduce the 98.0% ChEMBL drug-GCN result locally — deterministic, bit-equal across runs at seed=42; CPU-only, no GPU required.

# prerequisites: python3.10+, pip, make, git  (apt: sudo apt install python3 python3-venv make git)
# clone the public reproducibility repo and run the ChEMBL GCN baseline
git clone https://github.com/snptx1/snptx-repro-chembl.git
cd snptx-repro-chembl && make install repro-chembl   # ≈ 3 min install + 15 s run on CPU
cat results/metrics/drug_discovery_result.json   # expect accuracy = 0.9850586979722519

Pipeline rule graph

Architecture & proposed pilots

How the layers compose, and three pilot shapes drawn from the registry of validated components — aligned with graph-ML, knowledge-graph, and foundation-model work in academic biomedicine.

Pilot menu

Execution spine, mediated extensions, autonomous loop

Ordered stages with persisted artifacts; downstream analysis attaches through a contract-validated runner; the autonomy surface feeds next-run selection back into training. Full diagram on Architecture.

Engagement archetypes (A–D)

Four canonical pilot shapes drawn from the validated capability surface. Each runs 6–10 weeks on already-tested components; compose two for a richer scope. Full menu on the Pilots page.

Archetype A · Drug discovery Validated

Molecular-graph baseline on ChEMBL & Hetionet

Demonstrated: 97.5% accuracy on ChEMBL antibacterial classification (5-fold CV); reproducible end-to-end from a tagged commit; n=3 runs hash-equal under fixed seeds.

Data: chembl_subset · hetionet · open_targets.

Fit: drug repurposing, target prioritization, KG-grounded synergy work.

See Pilot A →

Archetype B · Multi-modal fusion Validated

Late-fusion ensemble across imaging + tabular + omics

Demonstrated: 93.0% accuracy on Visium spatial-transcriptomics fusion; 8 modality families integrated under a shared evaluation contract.

Data: visium_breast · tcga_pancan · gtex.

Fit: spatial-omics, multi-omics integration, biomarker discovery requiring cross-modal evidence.

See Pilot B →

Archetype C · Autonomous loop Validated

GP-guided experimentation with calibrated stopping

Demonstrated: 1,100 experiments / 1.06 hr, 104 discoveries, surrogate ECE = 0.055; EI + VoI acquisition with SPRT (α=β=0.05) stopping.

Data: hetionet · chembl_subset · depmap.

Fit: few-shot repurposing, hyperparameter search at scale, ranked candidate queues with calibrated CIs.

See Pilot C →

Archetype D · Reproducibility Roadmap

Cross-institutional benchmark replication

Scope: two labs run identical evaluation manifests on the same DAG; deterministic Snakemake/DVC pipeline replays under fixed seeds; the entire run ships with a hash-chained audit trail.

Data: hetionet · chembl_subset (planned: primekg).

Fit: graph-ML results needing externally-verified reproducibility; cross-lab benchmarking; resubmission appendix work.

See Pilot D →

Compose, don't commit. A pilot can take any one of A–D as its scope, or compose two (e.g., A + D to land a validated baseline under cross-lab reproducibility, or B + C to drive autonomous candidate selection over fused modalities). The full menu is on the Pilots page.