Runs every protocol over the gold set's split rows (default "dev")
and scores each against the gold labels: accuracy with a bootstrap CI,
macro-F1, and parse failures. This is the tuning loop: iterate freely
here; the test split waits, sealed, for the one protocol you lock.
Arguments
- protocols
A list of
protocol()objects (or a single one).- gold
A
gold_set().- split
Which split to evaluate on;
"test"is refused here. That isvalidate_protocol()'s job, and it leaves a ledger entry.- .runner
Internal seam for tests: a function
(experiments, ...)returning the experiments with aresponse_textcolumn. DefaultLLMR::call_llm_par().- ...
Passed to the runner (e.g.
tries,progress).
Value
A protocol_tuning result: tibble with one row per protocol
(protocol, n, accuracy, acc_lo, acc_hi, macro_f1,
parse_failures, tokens when the runner reports usage), plus
per-protocol detail in attr(x, "per_category").
Examples
if (FALSE) { # \dontrun{
cb <- codebook("tone", "one sentence",
list(cb_category("positive", "Approving."),
cb_category("negative", "Critical.")))
gold_data <- data.frame(
text = c(paste("clear benefit", 1:10), paste("serious harm", 1:10)),
label = rep(c("positive", "negative"), each = 10))
g <- gold_set(gold_data, text = "text", labels = "label",
split = c(dev = 0.5, test = 0.5))
cfg <- LLMR::llm_config("groq", "openai/gpt-oss-20b", temperature = 0)
tune_protocol(list(protocol(cb, cfg, label = "baseline")), g)
} # }