Validate a locked protocol on the sealed holdout split — validate

Evaluates one locked protocol on the sealed holdout split. The lock identifies the evaluated instrument by its hash. The function applies the configured replicate count and modal-label rule, and when the gold set was built with seal_holdout = TRUE, appends the event to the ledger. Each holdout-split evaluation then appears in LLMR::report().

Usage

validate_protocol(protocol, gold, split = NULL, .runner = NULL, ...)

Arguments

protocol: A locked protocol().
gold: A gold_set().
split: Which split to evaluate on. Defaults to the gold set's holdout split ("test" unless gold_set() was given another holdout name).
.runner: Offline runner seam: a function (experiments, ...) that receives a data frame with config and messages list-columns and returns those rows with at least response_text. Default LLMR::call_llm_par().
...: Passed to the runner (e.g. tries, progress).

Value

A protocol_validation: accuracy with bootstrap CI, macro-F1, parse failures, total tokens (when the runner reported them), per-category table, confusion matrix, the protocol hash, the gold set's holdout split name, and the ledger position of this evaluation.

Examples

if (FALSE) { # \dontrun{
cb <- codebook("tone", "one sentence",
  list(cb_category("positive", "Approving."),
       cb_category("negative", "Critical.")))
gold_data <- data.frame(
  text  = c(paste("clear benefit", 1:10), paste("serious harm", 1:10)),
  label = rep(c("positive", "negative"), each = 10))
g <- gold_set(gold_data, text = "text", label = "label",
              split = c(test = 1))
p <- protocol_lock(protocol(cb, LLMR::llm_config("groq", "openai/gpt-oss-20b")))
validate_protocol(p, g)
gold_ledger(g)   # the evaluation is on the record
} # }