Skip to contents

Build a gold-standard set with split provenance and a sealed test split

Usage

gold_set(
  data,
  text,
  labels,
  split = c(dev = 0.6, test = 0.4),
  stratify = TRUE,
  seal_test = TRUE,
  coders = NULL,
  id = NULL
)

Arguments

data

A data frame of human-labeled units. Gold labels must be complete: a missing label is not gold, and construction fails on NAs.

text

Name of the text column (character scalar).

labels

Name of the (adjudicated) label column.

split

Named numeric vector of proportions, e.g. c(dev = 0.6, test = 0.4). Must sum to 1. Sizes are allocated by largest remainder (exact, no rounding loss); assignment is random within the allocation – set a seed beforehand for a repeatable draw, and keep the saved object either way, since the split is stored, not recomputed.

stratify

If TRUE (default), the split is stratified by the label column, so dev and test carry the same class composition – the methods default for evaluation splits.

seal_test

If TRUE (default), the test split is sealed: every validate_protocol() run against it is recorded in the ledger and printed by coding_report(). The seal is visibility, not enforcement; save the gold set with the study and archive the LLM call log (see the archive workflow) when tamper evidence is needed.

coders

Optional character vector naming columns holding individual coder labels (pre-adjudication), used by coder_agreement().

id

Optional name of a column holding a stable unit identifier. When supplied, gold_correct() links audit units to the coded corpus by this id, which is the only way to disambiguate units that share identical text. The same column must be present in the corpus passed to code_corpus(). When omitted, linkage falls back to a content hash of the text, and duplicate texts among audited units are refused rather than matched arbitrarily.

Value

A gold_set: the data plus split assignment, seal status, and an evaluation ledger.

Examples

set.seed(110)   # the split assignment draws locally
g <- gold_set(
  data.frame(text  = paste0("unit", seq_len(40)),
             label = rep(c("x", "y"), each = 20)),
  text = "text", labels = "label", split = c(dev = 0.5, test = 0.5)
)
g
table(gold_split(g, "dev")$label)   # stratified: same class mix as test
gold_ledger(g)   # empty until something evaluates on the test split