LLM labels are error-prone measurements, and the error is rarely
symmetric, so the naive share of a category in the coded corpus is
biased. gold_correct() estimates corrected corpus-level category
prevalences by combining the full coded corpus with the gold set's test
split, and reports a standard error for each.
Arguments
- coded
A
code_corpus()result.- gold
A
gold_set()whose test split is an audited subsample of the coded corpus (linked by a sharedidwhen present, otherwise by a content hash of the text; see Details).- conf
Confidence level for the normal-approximation intervals.
Value
A gold_correction object: a list with table (category,
share_naive, share_corrected, se, ci_lo, ci_hi), n_corpus,
n_parse_failures, n_audit, n_audit_parse_failures,
accuracy_audit, protocol_hash, protocol_label, conf, link_by
(how audit units were linked to the corpus, "id" or "text hash"),
and sealed (whether the gold set's test split was sealed), with a
print method.
Details
For category c, the estimator is the corpus mean of
1(llm label = c) plus the audit mean of
1(gold label = c) - 1(llm label = c) – the survey-sampling
difference estimator, which is also what prediction-powered inference
reduces to for proportions. Its estimated variance is
(1 - n/N) * S2_d / n, where d is the audit difference, n the
number of audit pairs, and N the number of parsed corpus labels; the
corpus term carries no sampling error because the corpus itself is the
estimand's population.
Only the test split is used. The dev split tuned the protocol, so dev error rates are optimistic, and a correction built on them inherits the optimism.
The estimand conditions on parse success: corpus rows whose label is
NA are excluded from the shares and counted, as are matched audit
units whose corpus label is NA.
Two assumptions do the inferential work: the audited units are a random subsample of the corpus (or of the population the corpus represents), and the corpus labels and the audit-pair labels come from the same locked protocol – which the linkage guarantees here, because the audit pairs take their model labels from the coded corpus itself.
Linkage uses a shared id column when both gold_set() and
code_corpus() were given one; this is the only key that can tell apart
units with identical text. Absent an id, units are linked by a content hash
of the (whitespace-normalized) text, and audited units whose text is
duplicated in the corpus are refused rather than matched to an arbitrary
row – supply an id to handle genuine duplicates.
Corrected shares are not clamped to [0, 1]; a value outside the unit
interval is a signal that the audit correction is noisy, and the print
method says so when it happens.
Using test-split truth is a look at the test split, so when the gold
set is sealed the event is appended to the ledger and appears in
coding_report() like any other evaluation.
References
Angelopoulos, Bates, Fannjiang, Jordan, and Zrnic (2023). "Prediction-Powered Inference." Science 382(6671), 669-674.
Egami, Hinck, Stewart, and Wei (2023). "Using Imperfect Surrogates for Downstream Inference." Advances in Neural Information Processing Systems 36.
Cochran (1977). Sampling Techniques, 3rd edition, on the difference estimator.
Examples
if (FALSE) { # \dontrun{
cb <- codebook("stance", "one text",
list(cb_category("positive", "Approving."),
cb_category("negative", "Critical.")))
gold_data <- data.frame(
text = c(paste("clear benefit", 1:10), paste("serious harm", 1:10)),
label = rep(c("positive", "negative"), each = 10))
gold <- gold_set(gold_data, text = "text", labels = "label",
split = c(test = 1))
corpus <- data.frame(text = c(gold_data$text,
"a hopeful note", "an alarming figure"))
cfg <- LLMR::llm_config("groq", "openai/gpt-oss-20b", temperature = 0)
coded <- code_corpus(corpus, protocol_lock(protocol(cb, cfg)), "text")
gold_correct(coded, gold)
} # }
