Autonomous experimentation

Evidence for the current campaign runner: how benchmark campaigns are controlled, what outcomes are reported, and what these results establish.

Evidence view

Scoped benchmark campaign execution

Closed-loop experiment selection, execution, and update for supported benchmark campaigns.

This page documents the campaign controller as implemented in SNPTX. On the reported benchmark suites, it selects configurations from a declared search space, executes runs, classifies outcomes against the experiment record, and writes updated results back into subsequent selection. The evidence here concerns campaign control, throughput, and observed outcome quality within that reported setting.

Loop structure

Selection, execution, evaluation, and catalog update within a declared campaign boundary

The loop acts on a predefined campaign surface. It proposes supported configurations, executes the standard pipeline, compares outcomes with recorded history, and writes updated results back to the catalog for subsequent rounds.

Demonstrated

Once datasets, search space, and evaluation procedure are declared, the system can run repeated benchmark campaigns without manual stepwise intervention.

Evidence model

The evidence shown here is operational rather than theoretical: run volume, throughput, outcome classification, and a representative calibration check for one reported run.

Not shown here

This figure does not address how the same controller behaves under shifted data, expanded search spaces, tighter runtime constraints, or materially different downstream workflows.

Campaign context

Summary measures from the newer phase_c_validation metrics artifact

These headline values come from the newer phase_c_validation autonomous metrics artifact. They summarize aggregate run volume, discovery yield, and runtime for a later validation pass. The dataset and model-family counts still reflect the current autonomy configuration, while the detailed tables below remain artifact-specific examples rather than a single unified report.

1,065

Experiments

101

Discoveries

976.7

Exp / Hour

1.09 hr

Elapsed

Datasets

Model Types

Campaign outcomes

Detailed per-dataset checkpoint example

The newer validation metrics artifact does not expose a per-dataset best-result table. To keep the page inspectable at the dataset level, this table continues to show the archived autonomous_1k checkpoint, which preserves the strongest saved accuracy for each dataset in that benchmark run.

Dataset	Best accuracy
Iris	100.0%
Wine	100.0%
Digits	99.44%
Breast Cancer	97.37%
Synthea Readmission	77.96%

Reliability context

Separate saved calibration artifact

This calibration summary is not emitted by the newer validation metrics artifact or by the archived autonomous_1k run report. It is a separate saved model-level artifact for a Synthea readmission XGBoost model and is included here only as adjacent reliability context.

Metric	Value
Artifact	models/synthea_readmission_xgboost_model_300.pkl
Model / dataset	XGBoost on Synthea Readmission
Expected Calibration Error	0.0548
Brier Score	0.1599
Calibration quality	GOOD
Mean predicted probability	58.1%
Observed positive rate	58.9%

Interpretation and limits

The central question is not whether the loop is autonomous in the abstract, but what the reported evidence establishes about campaign behavior and what remains untested.

What is demonstrated

Repeated campaign execution

Across the reported metrics and checkpoint artifacts, SNPTX maintains the full campaign cycle from candidate selection through recorded outcome update while preserving an inspectable record of configurations, outputs, and outcome tags.

What is inferred

Directionally effective search

Aggregate discovery counts and checkpoint-level best results indicate that the selection policy can surface productive configurations within the declared search space. They do not establish global optimality or uniform superiority over alternative search strategies.

What is out of scope

Transfer beyond the reported setting

The evidence here is confined to the reported benchmark surfaces, model families, and search regime. It does not by itself establish that the same policy will transfer unchanged to new modalities, broader search spaces, different resource constraints, or downstream operational workflows.

Interpretive note

What this page establishes is narrower and more concrete: the controller can manage repeated benchmark campaigns while preserving inspectable metrics and checkpoint artifacts, even when different evidentiary slices are reported in separate files. Separate experiments would still be needed to compare search efficiency rigorously, assess robustness under distribution shift, or evaluate transfer to laboratory or clinical workflows.