SNPTX - Methodology

Experimental methodology

Compute environment, dataset coverage, per-modality training configuration, and the reproducibility controls that bound the reported results.

Methods

Environment, data, training, and reproducibility controls

Controls and configuration that bound the reported benchmark surface.

Experiments run on a fixed compute environment with versioned configuration, fixed seeds, deterministic CUDA where supported, DAG-enforced execution order, and tracked artifacts. These controls improve reproducibility within the documented scope; they are not presented as universal guarantees.

Compute and software environment

Single-node configuration used for the reported experiments. Static-analysis posture is reported separately under reproducibility controls.

Resource	Specification
Instance	AWS EC2 g5.xlarge
GPU	NVIDIA A10G, 23 GB VRAM
CUDA toolkit	12.1 (PyTorch build)
Python	3.11.14
PyTorch	2.5.1+cu121
Orchestration	Snakemake 9.16.3 (34 rules)
Tracking	MLflow 3.10.0
Versioning	DVC (configured; partial integration with the primary pipeline)

Dataset coverage

Eight modality families. The table summarizes the integrated adapters by family; see the note below for the full adapter accounting.

Modality	Dataset	Source	Samples	Task
Clinical tabular	Synthea readmission	Synthea (synthetic EHR)	6,625	Binary classification
Omics	Visium breast cancer	10x Genomics	3,798	14-class tissue region
Knowledge graphs	Hetionet	Himmelstein et al.	1,913 ego-graphs	8-class node classification
Histopathology	PathMNIST	MedMNIST	107,180	9-class classification
Clinical text	MTSamples	MTSamples.com	~3,500	5-class specialty
Single-cell	BloodMNIST	MedMNIST	17,092	8-class classification
Drug discovery	ChEMBL bioactivity	ChEMBL database	4,685	Binary bioactivity
Classical ML	Iris, Wine, Breast Cancer, Digits	UCI / scikit-learn	150–1,797	Multi-class classification

Adapter accounting: 46 adapters are declared in the data registry across these 8 families. 37 are integrated (one or more adapters per row above); 9 are specified but not yet integrated and are excluded from the reported runs.

Training configuration

Per-modality model choice and key hyperparameters. Full configurations live in versioned YAML referenced by each run.

Modality	Model	Key hyperparameters	Training details
Clinical tabular	XGBoost + Optuna	n_estimators=100–300, max_depth=6, lr=0.1	Optuna HPO (30 trials), 5-fold CV
Omics	VAE	latent_dim=128, hidden=[512,256], epochs=80	Visium HVG, Leiden labels, 80/20 val
Knowledge graphs	GAT	layers=2, hidden=64, heads=4, epochs=200	Center-node readout, cosine LR
Histopathology	DenseNet-121	ImageNet pretrained, epochs=20, lr=1e-4	28×28, PathMNIST augmentation
Clinical text	ClinicalBERT	epochs=15, lr=2e-5, batch=16	Mean-pooled embeddings, 5-class
Single-cell	DenseNet-121	ImageNet pretrained, epochs=20, lr=1e-4	28×28, BloodMNIST
Drug discovery	GCN	layers=3, hidden=128, epochs=100	Class-weighted loss, molecular featurization
Multi-modal fusion	Attention fusion	heads=4, hidden=256	PCA embeddings from training set only

Each run is defined by a YAML committed to version control; seed=1337 applies across numpy, torch, and sklearn unless a modality-specific seed is documented.

Reproducibility controls and boundaries

Each control is paired with its current limit. Determinism is treated as a bounded engineering property within this scope.

Seeds

Fixed global seed

seed=1337 for numpy, torch, and sklearn random operations.

Limit: stochastic library calls outside these three sources are not centrally seeded.

CUDA determinism

Deterministic algorithms

torch.use_deterministic_algorithms(True) is enabled where supported by the operator set in use.

Limit: a small number of GPU kernels remain non-deterministic and are documented per modality where they apply.

Versioned configuration

YAML-defined runs

Every run is defined by a YAML committed to version control; the run record references the exact configuration hash.

Limit: external dataset hosts are pinned by URL/version, not by content hash for every adapter.

Execution order

Snakemake DAG

Stage dependencies are declared in the Snakefile. The DAG enforces ordering with no implicit stage coupling.

Limit: parallel rule execution can interleave logs; per-rule artifacts remain ordered.

Artifact integrity

Hashed bundles

DVC checksum workflows are configured for the staged artifact bundles produced by each run.

Limit: full end-to-end DVC enforcement across the primary pipeline is in progress.

Code-quality posture

Static analysis

Pyright reports 0 errors and Ruff 0.15.4 reports 0 violations on the pipeline source at the time of the reported runs.

Limit: static checks bound code quality, not numerical correctness; correctness is bounded by the validation surface.

Per-run lifecycle

From configuration to evidence bundle

A run is fully described by its configuration hash and the artifact hashes it produces. Re-execution from the same configuration on the same environment is expected to reproduce the bundle within the determinism limits noted above.