Calibrate silicon responses against a human benchmark

Compares the panel's response marginals, item by item, to human benchmark marginals you supply (from ANES, GSS, Pew, your own fielded study). Calibration here reports deviation from the benchmark without adjusting the underlying estimates: deviations are reported as found, and the comparison is restricted to items the benchmark actually covers. Coverage is partial when only some items have a benchmark, and the print banner reflects it – a benchmark touching one of five items yields PARTIALLY CALIBRATED (1/5). Nonresponse (parse failures, refusals) is recorded per item alongside, since shares computed only over valid responses flatter an instrument the model often refuses.

Usage

panel_calibrate(responses, benchmark, benchmark_name = "benchmark")

Arguments

responses: A panel_administer() result.
benchmark: A data frame with columns item_id, response, and share (human marginal proportions). Shares within an item should sum to 1; a deviation beyond rounding draws a warning.
benchmark_name: How the source should be cited in reports (e.g. "ANES 2024 pilot").

Value

responses with the calibration attribute set: $table (per covered item and response: share_silicon, share_human, deviation), $nonresponse (per item), $items_covered / $items_total, $mad, $max_dev.

Examples

if (FALSE) { # \dontrun{
set.seed(110)
panel <- panel_from_margins(list(party = c(left = .5, right = .5)), n = 12)
instr <- panel_instrument(item_choice("plan", "Which plan do you prefer?",
                                      c("A", "B")))
cfg <- LLMR::llm_config("groq", "openai/gpt-oss-20b")
r <- panel_administer(panel, instr, cfg)
r   # UNCALIBRATED banner
bench <- data.frame(item_id = "plan", response = c("A", "B"),
                    share = c(.5, .5))
panel_calibrate(r, bench, "toy human study")
} # }