Benchmark surface

Seven unimodal benchmarks plus one tri-modal fusion run, each tied to a named dataset, model, and validation protocol.

Evidence
Per-modality benchmark accuracy

Eight evaluated settings across seven biomedical modalities and one fusion run.

Reported accuracies range from 0.610 (clinical text, 5-class classification with a 0.20 random baseline) to 0.993 (PathMNIST, 9-class). The validation column on each row records the protocol that produced the number; it is not omitted from any row.

Modality Dataset Model Accuracy Validation
Clinical tabularSynthea readmission (n=6,625)XGBoost + Optuna0.7775-fold CV, 0.774 ± 0.008
OmicsVisium breast cancer (n=3,798 spots)VAE0.928Validation split, 14-class Leiden clustering
Knowledge graphsHetionet ego-graphs (n=1,913)GAT0.815Validation split, center-node readout, 8-class
HistopathologyPathMNIST (n=107,180)DenseNet-1210.993Validation split, 9-class
Clinical textMTSamples (n≈3,500)ClinicalBERT0.6105-class specialty, random baseline 0.20
Single-cell imagingBloodMNIST (n=17,092)DenseNet-1210.969Test split, 8-class
Drug discoveryChEMBL bioactivity (n=4,685)GCN0.975Validation split, molecular graph classification
Multi-modal fusionVisium (omics + vision + spatial)Attention fusion0.930Validation split; vs. 0.928 omics-only on same dataset
Sorted view

Per-modality accuracy, descending

The same eight numbers as the table, ordered by accuracy. The bar widths are linear over [0, 1].

Histopathology · DenseNet-121
0.993
Drug discovery · GCN
0.975
Single-cell · DenseNet-121
0.969
Fusion · attention (Visium)
0.930
Omics · VAE
0.928
Knowledge graphs · GAT
0.815
Clinical tabular · XGBoost
0.777
Clinical text · ClinicalBERT
0.610
Reading these numbers

Benchmark accuracy, not clinical performance

Every row reports accuracy on a held-out split or cross-validation protocol of a public or synthetic dataset. These are infrastructure and methodology measurements that show the framework can train, evaluate, and report across heterogeneous biomedical modalities under one execution surface. They are not estimates of clinical utility. Clinical evaluation, where applicable, lives on the Validation page.

Pipeline that produced these numbers

Every row in the table above is the output of the same staged DAG. Stage transitions persist artifacts; a feedback edge from reporting drives next-run selection.

EXECUTION SPINE — SNAKEMAKE DAG · DVC ARTIFACTS · MLFLOW TRACKING Ingestion adapters Training models Evaluation metrics Diagnostics calibration Aggregation summaries Reporting bundles Feedback next-run select CLOSED-LOOP FEEDBACK · REPORTING DRIVES NEXT-RUN SELECTION
Artifact handoffs

Each stage persists state

Ingestion writes a dataset state, training writes a model, evaluation writes metrics, reporting writes the bundle that produced the rows above. Stage transitions cross artifact boundaries rather than in-process state.

Feedback edge

Reporting drives next-run selection

The feedback path links the reporting bundle back to next-run selection through declared interfaces. It influences which experiments run next; it does not bypass the staged sequence or the artifact record.

Delivery state of the program

Phase markers for the build that produced the table above. Status is reported as completed or partial; nothing here implies clinical or production readiness.

A
Data foundation
Complete
A.6
Theoretical hardening
Complete
B
Intelligence layer
Complete
B.6
Theoretical intelligence
Complete
C
Multi-modal expansion
Complete
C.6
Multi-modal hardening
Complete
D
Deployment & compliance
Partial
D.6
Deployment hardening
Partial
E
Platform & scale
Partial
Status taxonomy

Complete means the phase has shipped artifacts that the table above depends on. Partial means in-progress work toward declared deliverables; deployment, compliance posture, and platform scaling are not finished and are not claimed as production-ready.