Skip to contents

The one honest evaluation. Requires a locked protocol (so the validated instrument is hash-identified), runs it over the test split, and when the gold set was built with seal_test = TRUE appends the event to the ledger, so every test-split evaluation that ever happened appears in coding_report().

Usage

validate_protocol(protocol, gold, split = "test", .runner = NULL, ...)

Arguments

protocol

A locked protocol().

gold

A gold_set().

split

Default "test".

.runner

Internal seam for tests: a function (experiments, ...) returning the experiments with a response_text column. Default LLMR::call_llm_par().

...

Passed to the runner (e.g. tries, progress).

Value

A protocol_validation: accuracy with bootstrap CI, macro-F1, parse failures, total tokens (when the runner reported them), per-category table, confusion matrix, the protocol hash, and the ledger position of this evaluation.

Examples

if (FALSE) { # \dontrun{
cb <- codebook("tone", "one sentence",
  list(cb_category("positive", "Approving."),
       cb_category("negative", "Critical.")))
gold_data <- data.frame(
  text  = c(paste("clear benefit", 1:10), paste("serious harm", 1:10)),
  label = rep(c("positive", "negative"), each = 10))
g <- gold_set(gold_data, text = "text", labels = "label",
              split = c(test = 1))
p <- protocol_lock(protocol(cb, LLMR::llm_config("groq", "openai/gpt-oss-20b")))
validate_protocol(p, g)
gold_ledger(g)   # the evaluation is on the record
} # }