The one honest evaluation. Requires a locked protocol (so the validated
instrument is hash-identified), runs it over the test split, and when
the gold set was built with seal_test = TRUE appends the event to the
ledger, so every test-split evaluation that ever happened appears in
coding_report().
Arguments
- protocol
A locked
protocol().- gold
A
gold_set().- split
Default
"test".- .runner
Internal seam for tests: a function
(experiments, ...)returning the experiments with aresponse_textcolumn. DefaultLLMR::call_llm_par().- ...
Passed to the runner (e.g.
tries,progress).
Value
A protocol_validation: accuracy with bootstrap CI, macro-F1,
parse failures, total tokens (when the runner reported them),
per-category table, confusion matrix, the protocol hash, and the ledger
position of this evaluation.
Examples
if (FALSE) { # \dontrun{
cb <- codebook("tone", "one sentence",
list(cb_category("positive", "Approving."),
cb_category("negative", "Critical.")))
gold_data <- data.frame(
text = c(paste("clear benefit", 1:10), paste("serious harm", 1:10)),
label = rep(c("positive", "negative"), each = 10))
g <- gold_set(gold_data, text = "text", labels = "label",
split = c(test = 1))
p <- protocol_lock(protocol(cb, LLMR::llm_config("groq", "openai/gpt-oss-20b")))
validate_protocol(p, g)
gold_ledger(g) # the evaluation is on the record
} # }