Skip to contents

LLM labels are error-prone measurements, and the error is rarely symmetric, so the naive share of a category in the coded corpus is biased. gold_correct() estimates corrected corpus-level category prevalences by combining the full coded corpus with the gold set's test split, and reports a standard error for each.

Usage

gold_correct(coded, gold, conf = 0.95)

Arguments

coded

A code_corpus() result.

gold

A gold_set() whose test split is an audited subsample of the coded corpus (linked by a shared id when present, otherwise by a content hash of the text; see Details).

conf

Confidence level for the normal-approximation intervals.

Value

A gold_correction object: a list with table (category, share_naive, share_corrected, se, ci_lo, ci_hi), n_corpus, n_parse_failures, n_audit, n_audit_parse_failures, accuracy_audit, protocol_hash, protocol_label, conf, link_by (how audit units were linked to the corpus, "id" or "text hash"), and sealed (whether the gold set's test split was sealed), with a print method.

Details

For category c, the estimator is the corpus mean of 1(llm label = c) plus the audit mean of 1(gold label = c) - 1(llm label = c) – the survey-sampling difference estimator, which is also what prediction-powered inference reduces to for proportions. Its estimated variance is (1 - n/N) * S2_d / n, where d is the audit difference, n the number of audit pairs, and N the number of parsed corpus labels; the corpus term carries no sampling error because the corpus itself is the estimand's population.

Only the test split is used. The dev split tuned the protocol, so dev error rates are optimistic, and a correction built on them inherits the optimism.

The estimand conditions on parse success: corpus rows whose label is NA are excluded from the shares and counted, as are matched audit units whose corpus label is NA.

Two assumptions do the inferential work: the audited units are a random subsample of the corpus (or of the population the corpus represents), and the corpus labels and the audit-pair labels come from the same locked protocol – which the linkage guarantees here, because the audit pairs take their model labels from the coded corpus itself.

Linkage uses a shared id column when both gold_set() and code_corpus() were given one; this is the only key that can tell apart units with identical text. Absent an id, units are linked by a content hash of the (whitespace-normalized) text, and audited units whose text is duplicated in the corpus are refused rather than matched to an arbitrary row – supply an id to handle genuine duplicates.

Corrected shares are not clamped to [0, 1]; a value outside the unit interval is a signal that the audit correction is noisy, and the print method says so when it happens.

Using test-split truth is a look at the test split, so when the gold set is sealed the event is appended to the ledger and appears in coding_report() like any other evaluation.

References

Angelopoulos, Bates, Fannjiang, Jordan, and Zrnic (2023). "Prediction-Powered Inference." Science 382(6671), 669-674.

Egami, Hinck, Stewart, and Wei (2023). "Using Imperfect Surrogates for Downstream Inference." Advances in Neural Information Processing Systems 36.

Cochran (1977). Sampling Techniques, 3rd edition, on the difference estimator.

Examples

if (FALSE) { # \dontrun{
cb <- codebook("stance", "one text",
  list(cb_category("positive", "Approving."),
       cb_category("negative", "Critical.")))
gold_data <- data.frame(
  text  = c(paste("clear benefit", 1:10), paste("serious harm", 1:10)),
  label = rep(c("positive", "negative"), each = 10))
gold <- gold_set(gold_data, text = "text", labels = "label",
                 split = c(test = 1))
corpus <- data.frame(text = c(gold_data$text,
                              "a hopeful note", "an alarming figure"))
cfg <- LLMR::llm_config("groq", "openai/gpt-oss-20b", temperature = 0)
coded <- code_corpus(corpus, protocol_lock(protocol(cb, cfg)), "text")
gold_correct(coded, gold)
} # }