Closed-loop experiment selection, execution, and update for supported benchmark campaigns.
This page documents the campaign controller as implemented in SNPTX. On the reported benchmark suites, it selects configurations from a declared search space, executes runs, classifies outcomes against the experiment record, and writes updated results back into subsequent selection. The evidence here concerns campaign control, throughput, and observed outcome quality within that reported setting.
Selection, execution, evaluation, and catalog update within a declared campaign boundary
The loop acts on a predefined campaign surface. It proposes supported configurations, executes the standard pipeline, compares outcomes with recorded history, and writes updated results back to the catalog for subsequent rounds.
Once datasets, search space, and evaluation procedure are declared, the system can run repeated benchmark campaigns without manual stepwise intervention.
The evidence shown here is operational rather than theoretical: run volume, throughput, outcome classification, and a representative calibration check for one reported run.
This figure does not address how the same controller behaves under shifted data, expanded search spaces, tighter runtime constraints, or materially different downstream workflows.
Summary measures from the newer phase_c_validation metrics artifact
These headline values come from the newer phase_c_validation autonomous metrics artifact. They summarize aggregate run volume, discovery yield, and runtime for a later validation pass. The dataset and model-family counts still reflect the current autonomy configuration, while the detailed tables below remain artifact-specific examples rather than a single unified report.
Detailed per-dataset checkpoint example
The newer validation metrics artifact does not expose a per-dataset best-result table. To keep the page inspectable at the dataset level, this table continues to show the archived autonomous_1k checkpoint, which preserves the strongest saved accuracy for each dataset in that benchmark run.
| Dataset | Best accuracy |
|---|---|
| Iris | 100.0% |
| Wine | 100.0% |
| Digits | 99.44% |
| Breast Cancer | 97.37% |
| Synthea Readmission | 77.96% |
Separate saved calibration artifact
This calibration summary is not emitted by the newer validation metrics artifact or by the archived autonomous_1k run report. It is a separate saved model-level artifact for a Synthea readmission XGBoost model and is included here only as adjacent reliability context.
| Metric | Value |
|---|---|
| Artifact | models/synthea_readmission_xgboost_model_300.pkl |
| Model / dataset | XGBoost on Synthea Readmission |
| Expected Calibration Error | 0.0548 |
| Brier Score | 0.1599 |
| Calibration quality | GOOD |
| Mean predicted probability | 58.1% |
| Observed positive rate | 58.9% |
Interpretation and limits
The central question is not whether the loop is autonomous in the abstract, but what the reported evidence establishes about campaign behavior and what remains untested.
Repeated campaign execution
Across the reported metrics and checkpoint artifacts, SNPTX maintains the full campaign cycle from candidate selection through recorded outcome update while preserving an inspectable record of configurations, outputs, and outcome tags.
Directionally effective search
Aggregate discovery counts and checkpoint-level best results indicate that the selection policy can surface productive configurations within the declared search space. They do not establish global optimality or uniform superiority over alternative search strategies.
Transfer beyond the reported setting
The evidence here is confined to the reported benchmark surfaces, model families, and search regime. It does not by itself establish that the same policy will transfer unchanged to new modalities, broader search spaces, different resource constraints, or downstream operational workflows.
What this page establishes is narrower and more concrete: the controller can manage repeated benchmark campaigns while preserving inspectable metrics and checkpoint artifacts, even when different evidentiary slices are reported in separate files. Separate experiments would still be needed to compare search efficiency rigorously, assess robustness under distribution shift, or evaluate transfer to laboratory or clinical workflows.