Audit template and variance discipline

Headline metrics on Results are paired with a fixed set of audit controls. This page names the controls, marks their coverage status, and bounds the claims they support.

Validation view
Audit template · coverage status · bounded claims

Eight checks, grouped by purpose, with status flags and a cross-modal variance surface.

Validation uses an eight-check template organized into three groups: universal statistical controls, leakage controls activated by modality, and uncertainty / capacity diagnostics. Each check carries a coverage flag so the reader can see what is applied uniformly, what is selective, and what is partial. The CV summary table reports per-modality variance under the controls that gated each row.

Audit map

From checks to variance surface to boundary

SNPTX validation audit map Three vertical bands of audit checks (universal, leakage, uncertainty) feed into a cross-modal CV variance surface, which is bounded by an explicit statement of what is not established. UNIVERSAL CONTROLS applied across primary benchmarks 5-fold stratified CV multi-seed reruns bootstrap CIs trivial baselines LEAKAGE CONTROLS activated by modality spatial stratification (Visium) fusion holdout (PCA frozen) UNCERTAINTY & CAPACITY selective / partial coverage conformal prediction (partial) learning curves CROSS-MODAL CV SURFACE mean ± std per modality 8 modalities reported per-fold metrics retained controls noted per row BOUNDARY does not establish external or clinical validity UNIVERSAL = applied uniformly · LEAKAGE = activated where modality requires it · UNCERTAINTY = selective or partial

Three control groups feed a single cross-modal variance surface; the boundary card states the inferential limit of that surface.

Eight checks, grouped by purpose

Each check is tagged with a coverage status. Status reflects what is applied uniformly across primary benchmarks, what is activated where modality requires it, and what is in partial rollout.

Universal statistical controls

Applied across primary benchmarks to bound split sensitivity, randomization effects, and trivial-baseline gaps.

Check 1
Applied

k-Fold cross-validation

5-fold stratified CV with per-fold metrics and variance reporting; bounds the chance that a single favorable split drives a reported number.

Check 2
Selective

Multi-seed stability

Multi-seed reruns on representative workloads to estimate variance under initialization and ordering effects. Not run on every benchmark.

Check 3
Selective

Bootstrap confidence intervals

Bootstrap CIs computed for selected tasks where point estimates alone would understate metric uncertainty.

Check 8
Applied

Trivial baselines

Every reported model is compared against majority-class and random baselines on the same split. Reported results exceed both.

Leakage controls (activated by modality)

Applied where the modality permits a leakage path that universal CV would not catch.

Check 4
Applied

Spatial stratification (Visium)

For tissue data, splits prevent spatial autocorrelation from inflating performance: adjacent spots are never assigned across train and test.

Check 5
Applied

Fusion holdout (frozen PCA)

PCA embeddings are fit on the training set only; the test set is projected with frozen training-set parameters so fusion features cannot leak.

Uncertainty and capacity diagnostics

Diagnostics for prediction-set calibration and data-regime adequacy. Coverage is partial and is reported as such.

Check 6
Partial rollout

Conformal prediction

Conformal methods are integrated in targeted pipelines for uncertainty analysis. Broader coverage across modalities is in progress.

Check 7
Selective

Learning curves

Training set subsampled at fractions [0.1, 0.25, 0.5, 0.75, 1.0] to confirm that a model is not in a data-starved regime on the reported task.

How to read the status pills

Applied = run uniformly across primary benchmarks. Selective = run on a representative subset of benchmarks rather than all of them. Partial rollout = method integrated in some pipelines but not yet generalized across modalities.

Cross-modal CV summary

5-fold stratified CV under the controls listed above. The Controls column records which leakage controls gated each row; remaining checks apply per the universal group.

5-fold stratified CV summary

Mean ± std per modality, with leakage controls

Aggregate variance surface across reported modalities. Numbers are benchmark-task scores under the stated splits, not external-validation scores.

Modality Accuracy (mean ± std) F1 (mean ± std) Leakage controls
Clinical tabular77.4% ± 0.8%0.773 ± 0.009stratified k-fold
Omics92.5% ± 0.8%0.923 ± 0.009stratified k-fold
Knowledge graphs81.2% ± 1.1%0.810 ± 0.012stratified k-fold
Histopathology (Visium)99.1% ± 0.3%0.990 ± 0.004spatial split
Clinical text60.5% ± 1.5%0.598 ± 0.016stratified k-fold
Single-cell96.5% ± 0.6%0.964 ± 0.007stratified k-fold
Drug discovery97.2% ± 0.6%0.971 ± 0.007stratified k-fold
Multi-modal fusion92.8% ± 0.7%0.926 ± 0.008fusion holdout (frozen PCA)

stratified k-fold indicates 5-fold stratified CV with class balance preserved per fold. spatial split indicates Visium spots assigned so that adjacent spots are never split across train and test. fusion holdout (frozen PCA) indicates that PCA components are fit on training only and frozen at projection time. High-accuracy rows (histopathology, drug discovery, single-cell) reflect benchmark-task scores on the stated splits and are not claims of external-site or clinical performance; see the boundary statement below.

What this audit does and does not establish

The audit is a credibility floor for reported metrics, not a substitute for external validation, site generalization, or clinical evaluation.

Establishes

Split-stability under stratified CV

Reported numbers reflect aggregate variance across folds rather than a single favorable split, with per-fold metrics retained for inspection.

Establishes

Leakage controls where applicable

Spatial autocorrelation in Visium and PCA leakage in fusion are addressed by the gating controls noted in the table.

Does not establish

External or site generalization

The audit does not estimate performance on cohorts, sites, or distributions outside the splits used here. Cross-site evaluation is out of scope on this page.

Does not establish

Clinical or diagnostic validity

The reported numbers are benchmark-task scores. They do not constitute clinical validation and are not a deployment claim. See Limitations and Methodology.

Reproducibility pointer

Per-fold metrics, split definitions, and the configurations that produced this table are persisted as run artifacts under the execution spine described on the Architecture page. Procedural detail is on the Methodology page.