Research framework · Pilot brief

An Experimentation Layer for Multi-Modal Biomedical Machine Learning

Built during MITx Micromasters (Statistics & Data Science) and Harvard ALM (Data Science) · integrates Hetionet, Open Targets, ChEMBL, TCGA, GTEx, MIMIC-IV, AlphaFold, and PubMed-class corpora.

SNPTX organizes pipeline execution, comparative evaluation, and downstream analytical extension through artifact-defined stage boundaries and contract-validated extensions. The current build covers eight modality families — tabular, omics, graph, imaging, NLP, single-cell, molecular-graph drug discovery, and late-fusion ensemble — across 46 datasets.

1,037
Experiments / hour
8 / 46
Modalities / datasets
98.0%
ChEMBL drug GCN
104 / 1,100
Discoveries / campaign
1,357 tests 0 ruff 0 pyright Snakemake + DVC + MLflow deterministic seeding · k-fold leakage gates 5 implemented + 10 theory-grounded evidence-bounded across 6 dimensions
▸ Named datasets per modality
Knowledge graphs & pathways: hetionet, open_targets, gene_ontology, reactome_pathways (planned: string_human, primekg)
Molecular graphs & structure: chembl_subset, moleculenet_tox21, alphafold_targets
Omics — transcriptomics & eQTL: tcga_pancan, gtex_v8, gtex_eqtl
Omics — proteomics & metabolomics: cptac, pride_sample, metabolights_mtbls1
Single-cell & spatial: cellxgene_pbmc, tenx_pbmc_3k, visium_brain
Variants & functional genomics: clinvar, thousand_genomes, gnomad, depmap, gwas_catalog
Clinical & pharmacovigilance: synthea_readmission, mimic_iv, eicu_crd, cms_synpuf, faers (planned: aact, omop_synthea)
Text, imaging, time-series: mtsamples, pubmed_sample, mitbih (planned: mimic_cxr, chexpert_small, camelyon17_sample, mimic_notes, hmp2_ibd)
DVC-versioned (s3://snptx-dvc-prod, gs://snptx-dvc-storage); ingested via the adapter-based build_dataset_generic rule.
Color encoding — how to read the chips
green · operational today orange · pre-pilot work blue · neutral fact purple · differentiator
Capability surface (4) — autonomous engine, extensions, deployment, benchmark surface
★ Lead capability
Capability
Autonomous experimentation

Bayesian optimization daemon (ExperimentEngine): GP surrogates over hyperparameter space, expected-improvement plus value-of-information acquisition, SPRT stopping (α=β=0.05), DuckDB experiment catalog. Validated at 1,037 experiments / hour. Outputs: ranked next-run queue, surrogate-calibration trace, novelty score per candidate.

• Operational today
Capability
Contract-validated extensions

Seven extension modules under a YAML-contract execution layer; four Tier-1 extensions are contract-validated and stable today. Schema convergence (v1 / v2) is pre-pilot work, scoped and tracked.

• 4 stable · v1/v2 pre-pilot
Capability
Deployment pathway

FastAPI inference (tabular, fusion, batch), RBAC with HMAC-signed tokens, hash-chained audit log (21 CFR Part 11 posture), customer-hosted via Terraform and Helm.

• Available on request
Capability
Cross-lab benchmark surface

A shared evaluation surface for graph-ML & foundation-model methods: leakage-free splits, calibration-corrected leaderboards, lineage-stamped checkpoints, contract-validated reports. Designed so two labs can compare results on identical evaluation manifests.

• Operational today
What SNPTX claims

Reproducible multi-modal pipelines, contract-validated extensions, calibrated benchmarks, and an autonomous experimentation surface — validated across 8 axes through versioned artifacts.

What SNPTX does not claim

Diagnostic, prescribing, or autonomous-clinical capability. No replacement for wet-lab validation, IRB, or regulatory review. See Limitations.

Pilot in 4 steps

Proposed pilot

A 6–10 week, jointly-authored run on one bounded research workflow — one lab-specific data adapter (mapping your expression matrices, KG triples, or imaging manifests to the stage-1 contract), one evaluation package built on Tier-1 validated components (calibration-corrected leaderboard, leakage-free comparison, contract-validation report), reproducible from a tagged commit. Deployment hardening only where the pilot requires it.

1 · Scope

Together we pick one workflow, one dataset family, and one evaluation package. The pilot succeeds or fails on a single, agreed objective.

2 · Adapter

A schema adapter maps your inputs — expression matrices, KG triples, imaging manifests, or clinical tabular — to the stage-1 contract. Data stays in your environment.

3 · Run

The pipeline executes under Snakemake + DVC with MLflow tracking. The autonomous engine (GP + SPRT + VoI) is opt-in for hyperparameter search.

4 · Evaluation package

Tier-1 extensions emit auditable artifacts — calibration, metric aggregation, evaluation summary, contract checks — reviewed jointly with your team.

What you receive

Concrete artifacts

  • Reproducible pipeline. Snakemake DAG + DVC-versioned data + MLflow run history; n=3 runs are hash-equal under fixed seeds.
  • Evaluation package. Calibration + metric aggregation + leakage-free comparison + contract-validation report.
  • Lineage manifest. Per-checkpoint data hash, model hash, commit, and config — suitable for a reproducibility appendix.
  • Hash-chained audit log. 21 CFR Part 11 posture; every stage transition is independently verifiable.
  • Joint write-up. Methods + results section drafted with your team, ready for appendix or preprint.
Verify in 5 minutes — reproduce the 98.0% ChEMBL result locally
Verify in 5 minutes
Performance spectrum · 99.3% imaging · 92.8% omics · 98.5% drug-GCN — full results →

Reproduce the 98.0% ChEMBL drug-GCN result locally — deterministic, bit-equal across runs at seed=42; CPU-only, no GPU required.

# prerequisites: python3.10+, pip, make, git  (apt: sudo apt install python3 python3-venv make git)
# clone the public reproducibility repo and run the ChEMBL GCN baseline
git clone https://github.com/snptx1/snptx-repro-chembl.git
cd snptx-repro-chembl && make install repro-chembl   # ≈ 3 min install + 15 s run on CPU
cat results/metrics/drug_discovery_result.json   # expect accuracy = 0.9850586979722519
Pipeline rule graph
ingest prep train eval report target build_dataset validate_data split_data train_model evaluate_model summarize_evaluations plot / report all 8 rules · click to view raw DAG

Architecture & proposed pilots

How the layers compose, and three pilot shapes drawn from the registry of validated components — aligned with graph-ML, knowledge-graph, and foundation-model work in academic biomedicine.

Pilot menu

Execution spine, mediated extensions, autonomous loop

Ordered stages with persisted artifacts; downstream analysis attaches through a contract-validated runner; the autonomy surface feeds next-run selection back into training. Full diagram on Architecture.

SNPTX execution spine, extension runner, autonomy and deployment surfaces Four-stage execution spine (ingestion, training, evaluation, reporting) with persisted artifacts; mediated contract-validated extension runner attaches through declared interfaces; an autonomous experimentation surface feeds back into training; a deployment surface receives the evaluation package. EXECUTION SPINE — SNAKEMAKE DAG · DVC ARTIFACTS · MLFLOW TRACKING Ingestion adapters + manifests Training models + configs Evaluation metrics + comparisons Reporting evidence packages dataset state trained model evaluation set report bundle contract-validated extension runner — calibration · aggregation · evaluation summary AUTONOMOUS EXPERIMENTATION · EXPERIMENTENGINE GP surrogate + EI / VoI acquisition SPRT (α=β=0.05) · DuckDB experiment catalog 1,037 exp / hr validated throughput DEPLOYMENT SURFACE · EVALUATION PACKAGE FastAPI · RBAC · HMAC-signed tokens hash-chained audit (21 CFR Part 11 posture) customer-hosted via Terraform + Helm NEXT-RUN SELECTION

Engagement archetypes (A–D)

Four canonical pilot shapes drawn from the validated capability surface. Each runs 6–10 weeks on already-tested components; compose two for a richer scope. Full menu on the Pilots page.

Archetype A · Drug discovery Validated

Molecular-graph baseline on ChEMBL & Hetionet

Demonstrated: 97.5% accuracy on ChEMBL antibacterial classification (5-fold CV); reproducible end-to-end from a tagged commit; n=3 runs hash-equal under fixed seeds.

Data: chembl_subset · hetionet · open_targets.

Fit: drug repurposing, target prioritization, KG-grounded synergy work.

See Pilot A →
Archetype B · Multi-modal fusion Validated

Late-fusion ensemble across imaging + tabular + omics

Demonstrated: 93.0% accuracy on Visium spatial-transcriptomics fusion; 8 modality families integrated under a shared evaluation contract.

Data: visium_breast · tcga_pancan · gtex.

Fit: spatial-omics, multi-omics integration, biomarker discovery requiring cross-modal evidence.

See Pilot B →
Archetype C · Autonomous loop Validated

GP-guided experimentation with calibrated stopping

Demonstrated: 1,100 experiments / 1.06 hr, 104 discoveries, surrogate ECE = 0.055; EI + VoI acquisition with SPRT (α=β=0.05) stopping.

Data: hetionet · chembl_subset · depmap.

Fit: few-shot repurposing, hyperparameter search at scale, ranked candidate queues with calibrated CIs.

See Pilot C →
Archetype D · Reproducibility Roadmap

Cross-institutional benchmark replication

Scope: two labs run identical evaluation manifests on the same DAG; deterministic Snakemake/DVC pipeline replays under fixed seeds; the entire run ships with a hash-chained audit trail.

Data: hetionet · chembl_subset (planned: primekg).

Fit: graph-ML results needing externally-verified reproducibility; cross-lab benchmarking; resubmission appendix work.

See Pilot D →

Compose, don't commit. A pilot can take any one of A–D as its scope, or compose two (e.g., A + D to land a validated baseline under cross-lab reproducibility, or B + C to drive autonomous candidate selection over fused modalities). The full menu is on the Pilots page.