Tune candidate protocols on the development split — tune

Runs every protocol over the gold set's split rows (default "dev") and scores each against the gold labels: accuracy with a bootstrap CI, macro-F1, and parse failures. This is the tuning loop: iterate freely here; the test split waits, sealed, for the one protocol you lock.

Usage

tune_protocol(protocols, gold, split = "dev", .runner = NULL, ...)

Arguments

protocols: A list of protocol() objects (or a single one).
gold: A gold_set().
split: Which split to evaluate on; "test" is refused here. That is validate_protocol()'s job, and it leaves a ledger entry.
.runner: Internal seam for tests: a function (experiments, ...) returning the experiments with a response_text column. Default LLMR::call_llm_par().
...: Passed to the runner (e.g. tries, progress).

Value

A protocol_tuning result: tibble with one row per protocol (protocol, n, accuracy, acc_lo, acc_hi, macro_f1, parse_failures, tokens when the runner reports usage), plus per-protocol detail in attr(x, "per_category").

Examples

if (FALSE) { # \dontrun{
cb <- codebook("tone", "one sentence",
  list(cb_category("positive", "Approving."),
       cb_category("negative", "Critical.")))
gold_data <- data.frame(
  text  = c(paste("clear benefit", 1:10), paste("serious harm", 1:10)),
  label = rep(c("positive", "negative"), each = 10))
g  <- gold_set(gold_data, text = "text", labels = "label",
               split = c(dev = 0.5, test = 0.5))
cfg <- LLMR::llm_config("groq", "openai/gpt-oss-20b", temperature = 0)
tune_protocol(list(protocol(cb, cfg, label = "baseline")), g)
} # }