Autonomous experimentation

Evidence for the current campaign runner: how benchmark campaigns are controlled, what outcomes are reported, and what these results establish.

Evidence view
Scoped benchmark campaign execution

Closed-loop experiment selection, execution, and update for supported benchmark campaigns.

This page documents the campaign controller as implemented in SNPTX. On the reported benchmark suites, it selects configurations from a declared search space, executes runs, classifies outcomes against the experiment record, and writes updated results back into subsequent selection. The evidence here concerns campaign control, throughput, and observed outcome quality within that reported setting.

Loop structure

Selection, execution, evaluation, and catalog update within a declared campaign boundary

The loop acts on a predefined campaign surface. It proposes supported configurations, executes the standard pipeline, compares outcomes with recorded history, and writes updated results back to the catalog for subsequent rounds.

SNPTX autonomous experimentation loop A bounded campaign loop showing candidate selection, pipeline execution, evaluation and classification, catalog update, and next-run prioritization within supported benchmark surfaces. SUPPORTED BENCHMARK CAMPAIGN SURFACE Candidate selection adaptive defaults + meta-analysis supported configuration space only Pipeline execution standard run orchestration metrics and artifacts captured Evaluation and classification compare against recorded history new_best, novel_config, surprise Catalog update write outcomes and update priors persist for the next round selection record run artifacts discovery tags updated priors
Demonstrated

Once datasets, search space, and evaluation procedure are declared, the system can run repeated benchmark campaigns without manual stepwise intervention.

Evidence model

The evidence shown here is operational rather than theoretical: run volume, throughput, outcome classification, and a representative calibration check for one reported run.

Not shown here

This figure does not address how the same controller behaves under shifted data, expanded search spaces, tighter runtime constraints, or materially different downstream workflows.

Campaign context

Summary measures from the newer phase_c_validation metrics artifact

These headline values come from the newer phase_c_validation autonomous metrics artifact. They summarize aggregate run volume, discovery yield, and runtime for a later validation pass. The dataset and model-family counts still reflect the current autonomy configuration, while the detailed tables below remain artifact-specific examples rather than a single unified report.

1,065
Experiments
101
Discoveries
976.7
Exp / Hour
1.09 hr
Elapsed
5
Datasets
6
Model Types
Campaign outcomes

Detailed per-dataset checkpoint example

The newer validation metrics artifact does not expose a per-dataset best-result table. To keep the page inspectable at the dataset level, this table continues to show the archived autonomous_1k checkpoint, which preserves the strongest saved accuracy for each dataset in that benchmark run.

Dataset Best accuracy
Iris100.0%
Wine100.0%
Digits99.44%
Breast Cancer97.37%
Synthea Readmission77.96%
Reliability context

Separate saved calibration artifact

This calibration summary is not emitted by the newer validation metrics artifact or by the archived autonomous_1k run report. It is a separate saved model-level artifact for a Synthea readmission XGBoost model and is included here only as adjacent reliability context.

Metric Value
Artifactmodels/synthea_readmission_xgboost_model_300.pkl
Model / datasetXGBoost on Synthea Readmission
Expected Calibration Error0.0548
Brier Score0.1599
Calibration qualityGOOD
Mean predicted probability58.1%
Observed positive rate58.9%

Interpretation and limits

The central question is not whether the loop is autonomous in the abstract, but what the reported evidence establishes about campaign behavior and what remains untested.

What is demonstrated

Repeated campaign execution

Across the reported metrics and checkpoint artifacts, SNPTX maintains the full campaign cycle from candidate selection through recorded outcome update while preserving an inspectable record of configurations, outputs, and outcome tags.

What is inferred

Directionally effective search

Aggregate discovery counts and checkpoint-level best results indicate that the selection policy can surface productive configurations within the declared search space. They do not establish global optimality or uniform superiority over alternative search strategies.

What is out of scope

Transfer beyond the reported setting

The evidence here is confined to the reported benchmark surfaces, model families, and search regime. It does not by itself establish that the same policy will transfer unchanged to new modalities, broader search spaces, different resource constraints, or downstream operational workflows.

Interpretive note

What this page establishes is narrower and more concrete: the controller can manage repeated benchmark campaigns while preserving inspectable metrics and checkpoint artifacts, even when different evidentiary slices are reported in separate files. Separate experiments would still be needed to compare search efficiency rigorously, assess robustness under distribution shift, or evaluate transfer to laboratory or clinical workflows.