An Experimentation Layer for Multi-Modal Biomedical Machine Learning
Built during MITx Micromasters (Statistics & Data Science) and Harvard ALM (Data Science) · integrates Hetionet, Open Targets, ChEMBL, TCGA, GTEx, MIMIC-IV, AlphaFold, and PubMed-class corpora.
SNPTX organizes pipeline execution, comparative evaluation, and downstream analytical extension through artifact-defined stage boundaries and contract-validated extensions. The current build covers eight modality families — tabular, omics, graph, imaging, NLP, single-cell, molecular-graph drug discovery, and late-fusion ensemble — across 46 datasets.
▸ Named datasets per modality
build_dataset_generic rule.Color encoding — how to read the chips
Capability surface (4) — autonomous engine, extensions, deployment, benchmark surface
Autonomous experimentation
Bayesian optimization daemon (ExperimentEngine): GP surrogates over hyperparameter space, expected-improvement plus value-of-information acquisition, SPRT stopping (α=β=0.05), DuckDB experiment catalog. Validated at 1,037 experiments / hour. Outputs: ranked next-run queue, surrogate-calibration trace, novelty score per candidate.
Contract-validated extensions
Seven extension modules under a YAML-contract execution layer; four Tier-1 extensions are contract-validated and stable today. Schema convergence (v1 / v2) is pre-pilot work, scoped and tracked.
• 4 stable · v1/v2 pre-pilotDeployment pathway
FastAPI inference (tabular, fusion, batch), RBAC with HMAC-signed tokens, hash-chained audit log (21 CFR Part 11 posture), customer-hosted via Terraform and Helm.
• Available on requestCross-lab benchmark surface
A shared evaluation surface for graph-ML & foundation-model methods: leakage-free splits, calibration-corrected leaderboards, lineage-stamped checkpoints, contract-validated reports. Designed so two labs can compare results on identical evaluation manifests.
• Operational todayReproducible multi-modal pipelines, contract-validated extensions, calibrated benchmarks, and an autonomous experimentation surface — validated across 8 axes through versioned artifacts.
Diagnostic, prescribing, or autonomous-clinical capability. No replacement for wet-lab validation, IRB, or regulatory review. See Limitations.
Proposed pilot
A 6–10 week, jointly-authored run on one bounded research workflow — one lab-specific data adapter (mapping your expression matrices, KG triples, or imaging manifests to the stage-1 contract), one evaluation package built on Tier-1 validated components (calibration-corrected leaderboard, leakage-free comparison, contract-validation report), reproducible from a tagged commit. Deployment hardening only where the pilot requires it.
1 · Scope
Together we pick one workflow, one dataset family, and one evaluation package. The pilot succeeds or fails on a single, agreed objective.
2 · Adapter
A schema adapter maps your inputs — expression matrices, KG triples, imaging manifests, or clinical tabular — to the stage-1 contract. Data stays in your environment.
3 · Run
The pipeline executes under Snakemake + DVC with MLflow tracking. The autonomous engine (GP + SPRT + VoI) is opt-in for hyperparameter search.
4 · Evaluation package
Tier-1 extensions emit auditable artifacts — calibration, metric aggregation, evaluation summary, contract checks — reviewed jointly with your team.
Concrete artifacts
- ▸Reproducible pipeline. Snakemake DAG + DVC-versioned data + MLflow run history; n=3 runs are hash-equal under fixed seeds.
- ▸Evaluation package. Calibration + metric aggregation + leakage-free comparison + contract-validation report.
- ▸Lineage manifest. Per-checkpoint data hash, model hash, commit, and config — suitable for a reproducibility appendix.
- ▸Hash-chained audit log. 21 CFR Part 11 posture; every stage transition is independently verifiable.
- ▸Joint write-up. Methods + results section drafted with your team, ready for appendix or preprint.
Verify in 5 minutes — reproduce the 98.0% ChEMBL result locally
Reproduce the 98.0% ChEMBL drug-GCN result locally — deterministic, bit-equal across runs at seed=42; CPU-only, no GPU required.
# prerequisites: python3.10+, pip, make, git (apt: sudo apt install python3 python3-venv make git) # clone the public reproducibility repo and run the ChEMBL GCN baseline git clone https://github.com/snptx1/snptx-repro-chembl.git cd snptx-repro-chembl && make install repro-chembl # ≈ 3 min install + 15 s run on CPU cat results/metrics/drug_discovery_result.json # expect accuracy = 0.9850586979722519