Eight checks, grouped by purpose, with status flags and a cross-modal variance surface.
Validation uses an eight-check template organized into three groups: universal statistical controls, leakage controls activated by modality, and uncertainty / capacity diagnostics. Each check carries a coverage flag so the reader can see what is applied uniformly, what is selective, and what is partial. The CV summary table reports per-modality variance under the controls that gated each row.
From checks to variance surface to boundary
Three control groups feed a single cross-modal variance surface; the boundary card states the inferential limit of that surface.
Eight checks, grouped by purpose
Each check is tagged with a coverage status. Status reflects what is applied uniformly across primary benchmarks, what is activated where modality requires it, and what is in partial rollout.
Applied across primary benchmarks to bound split sensitivity, randomization effects, and trivial-baseline gaps.
k-Fold cross-validation
5-fold stratified CV with per-fold metrics and variance reporting; bounds the chance that a single favorable split drives a reported number.
Multi-seed stability
Multi-seed reruns on representative workloads to estimate variance under initialization and ordering effects. Not run on every benchmark.
Bootstrap confidence intervals
Bootstrap CIs computed for selected tasks where point estimates alone would understate metric uncertainty.
Trivial baselines
Every reported model is compared against majority-class and random baselines on the same split. Reported results exceed both.
Applied where the modality permits a leakage path that universal CV would not catch.
Spatial stratification (Visium)
For tissue data, splits prevent spatial autocorrelation from inflating performance: adjacent spots are never assigned across train and test.
Fusion holdout (frozen PCA)
PCA embeddings are fit on the training set only; the test set is projected with frozen training-set parameters so fusion features cannot leak.
Diagnostics for prediction-set calibration and data-regime adequacy. Coverage is partial and is reported as such.
Conformal prediction
Conformal methods are integrated in targeted pipelines for uncertainty analysis. Broader coverage across modalities is in progress.
Learning curves
Training set subsampled at fractions [0.1, 0.25, 0.5, 0.75, 1.0] to confirm that a model is not in a data-starved regime on the reported task.
Applied = run uniformly across primary benchmarks. Selective = run on a representative subset of benchmarks rather than all of them. Partial rollout = method integrated in some pipelines but not yet generalized across modalities.
Cross-modal CV summary
5-fold stratified CV under the controls listed above. The Controls column records which leakage controls gated each row; remaining checks apply per the universal group.
Mean ± std per modality, with leakage controls
Aggregate variance surface across reported modalities. Numbers are benchmark-task scores under the stated splits, not external-validation scores.
| Modality | Accuracy (mean ± std) | F1 (mean ± std) | Leakage controls |
|---|---|---|---|
| Clinical tabular | 77.4% ± 0.8% | 0.773 ± 0.009 | stratified k-fold |
| Omics | 92.5% ± 0.8% | 0.923 ± 0.009 | stratified k-fold |
| Knowledge graphs | 81.2% ± 1.1% | 0.810 ± 0.012 | stratified k-fold |
| Histopathology (Visium) | 99.1% ± 0.3% | 0.990 ± 0.004 | spatial split |
| Clinical text | 60.5% ± 1.5% | 0.598 ± 0.016 | stratified k-fold |
| Single-cell | 96.5% ± 0.6% | 0.964 ± 0.007 | stratified k-fold |
| Drug discovery | 97.2% ± 0.6% | 0.971 ± 0.007 | stratified k-fold |
| Multi-modal fusion | 92.8% ± 0.7% | 0.926 ± 0.008 | fusion holdout (frozen PCA) |
stratified k-fold indicates 5-fold stratified CV with class balance preserved per fold.
spatial split indicates Visium spots assigned so that adjacent spots are never split across train and test.
fusion holdout (frozen PCA) indicates that PCA components are fit on training only and frozen at projection time.
High-accuracy rows (histopathology, drug discovery, single-cell) reflect benchmark-task scores on the stated splits and are not claims of external-site or clinical performance; see the boundary statement below.
What this audit does and does not establish
The audit is a credibility floor for reported metrics, not a substitute for external validation, site generalization, or clinical evaluation.
Split-stability under stratified CV
Reported numbers reflect aggregate variance across folds rather than a single favorable split, with per-fold metrics retained for inspection.
Leakage controls where applicable
Spatial autocorrelation in Visium and PCA leakage in fusion are addressed by the gating controls noted in the table.
External or site generalization
The audit does not estimate performance on cohorts, sites, or distributions outside the splits used here. Cross-site evaluation is out of scope on this page.
Clinical or diagnostic validity
The reported numbers are benchmark-task scores. They do not constitute clinical validation and are not a deployment claim. See Limitations and Methodology.
Per-fold metrics, split definitions, and the configurations that produced this table are persisted as run artifacts under the execution spine described on the Architecture page. Procedural detail is on the Methodology page.