This vignette is the package design document. The literature on silicon sampling is genuinely split. Persona conditioned models reproduce some human marginals strikingly well and fail others silently through over uniform variances, sanitized refusals, order effects, and sensitivity to persona phrasing. Both findings are real. The design conclusion is not “never simulate” but “never simulate uncalibrated”, and the package encodes that conclusion in its objects rather than in a warning paragraph. Every result prints an UNCALIBRATED banner until a benchmark comparison fills the calibration slot, and partial benchmarks earn only a PARTIALLY CALIBRATED one.
The chunks below are live model calls when RUN_LIVE is
set to TRUE. They use real LLMR::llm_config()
objects on the LLMR provider layer and call execution functions without
a runner argument. With the switch left as FALSE, the
document builds without keys or model spend, while the prose explains
what the live run is meant to show.
What silicon panels are for
Instrument pretesting. Confusing items, broken branching, and attribute levels nobody distinguishes can be caught before any human respondent’s time is spent.
Design piloting. A conjoint or vignette design can be exercised across many personas so the analyst can inspect response dispersion and price the later human study.
Model measurement. How a model answers attitude items under varied personas is a finding about the model. Bias audits are research objects, not merely quality checks.
Estimating human quantities requires calibration against a human
benchmark before any human-population interpretation is warranted.
panel_calibrate() performs that comparison item by item;
the uncalibrated banner stays in place until the calibration is
supplied.
A walk through the machinery
panel_from_margins() draws a silicon panel from marginal
targets. The seed is set here because the panel draw is local random
generation that the vignette should reproduce.
set.seed(110)
panel = panel_from_margins(
list(
age = c("18 to 34" = .30, "35 to 64" = .45, "65 plus" = .25),
party = c(left = .45, right = .45, independent = .10)
),
n = 12,
persona_template = "A {age} year old voter who leans {party}."
)
panel
instr = panel_instrument(list(
item_likert("wk4", "A four day work week would benefit society."),
item_choice(
"fund",
"Which should the city fund first?",
c("public transit", "road repair")
),
item_open("why", "In one sentence, why?")
))
instrMargins are useful when targets are published as tables. When
microdata is available, panel_from_data() is the joint
distribution counterpart. It draws personas from observed rows and
therefore preserves relationships among attributes rather than sampling
each margin independently.
Administration randomizes item and option order per respondent and records what each one saw. With language models, option order is a treatment. The first configuration is the main pilot model. The second family is useful when the question is whether a pattern is tied to one model family or is visible across families.
cfg = LLMR::llm_config("groq", "openai/gpt-oss-20b", temperature = 0.8)
cfg_qwen = LLMR::llm_config("groq", "qwen/qwen3-32b", temperature = 0.8)
resp = panel_administer(panel, instr, cfg)
resp
panel_bias_audit(resp)
LLMR::diagnostics(resp)
resp_qwen = panel_administer(panel, instr, cfg_qwen)
panel_bias_audit(resp_qwen)The first printed response object reports UNCALIBRATED. That is the intended default. The comparison model is not a human benchmark; it is a model measurement contrast.
Calibration is comparison, not adjustment
Deviations are reported, not massaged away. The comparison is restricted to items the benchmark actually covers, and coverage is part of the verdict. The following toy benchmark is inline so the example is complete. It demonstrates the mechanics of calibration but is not evidence about a real population.
bench_fund = data.frame(
item_id = rep("fund", 2),
response = c("public transit", "road repair"),
share = c(0.41, 0.59)
)
resp_partial = panel_calibrate(
resp,
bench_fund,
benchmark_name = "toy city survey"
)
resp_partialBecause this benchmark touches only one of the closed items, the object earns only PARTIALLY CALIBRATED. Covering every closed item lets the object report the measured deviation for this toy benchmark in place of the uncalibrated banner.
bench_all = rbind(
bench_fund,
data.frame(
item_id = rep("wk4", 5),
response = c(
"strongly disagree",
"disagree",
"neutral",
"agree",
"strongly agree"
),
share = c(.05, .20, .25, .35, .15)
)
)
resp = panel_calibrate(
resp,
bench_all,
benchmark_name = "toy city survey"
)
resp
panel_report(resp)
LLMR::report(resp)
attr(resp, "calibration")$nonresponseNonresponse is recorded per item. Shares computed only over valid answers can flatter an instrument the model often refuses, so the calibration object keeps the missingness visible.
Factorial stimuli
Factorial designs are local design objects. A vignette design expands a template over attribute levels.
vignette_design(
"A {age} applicant with {experience} of experience applies.",
list(
age = c("younger", "older"),
experience = c("5 years", "20 years")
)
)A conjoint design draws randomized tasks. The seed is set here because the design draw is local random generation.
set.seed(110)
design = conjoint_design(
list(
price = c("low", "high"),
origin = c("domestic", "imported")
),
n_tasks = 4
)
designProfiles within a conjoint task are guaranteed distinct, with a
warning when the attribute space is too small to allow it.
conjoint_instrument() renders the design into one forced
choice item per task. After live administration, amce()
estimates average marginal component effects with respondent clustered
standard errors.
cj_instr = conjoint_instrument(design, "Which product would you buy?")
cj = panel_administer(panel, cj_instr, cfg)
amce(cj)The AMCE table is a diagnostic of how this configured model responded to the randomized profiles. A price effect appears only if the live choices express such a preference; an origin effect appears only if the model distinguishes that attribute.
Pricing the human study
A silicon pilot prices the design before any human is recruited.
panel_power() reads dispersion from the pilot, such as the
standard deviation of a Likert score or the modal share of a choice, and
returns the per arm sample size a human study would need to detect a
specified effect. The number inherits the pilot’s calibration status.
With the toy benchmark above, it is a design calculation rather than
external validation.
panel_power(resp, effect = c(wk4 = 0.4, fund = 0.15))Why open weights matter here specifically
A panel study makes one model call per persona and item, so the number of calls scales with panel size and instrument length, multiplied again by any order randomization and replicates. Open weight prices change what can be piloted. Volume stops being the primary constraint, and a panel pinned to an open checkpoint becomes a durable instrument. The same personas can answer the same items later, which no deprecated API model can promise.
Relations
LLMRpanel is the quantitative member of the family. LLMR supplies the provider abstraction used by the live calls above. LLMRcontent is the companion that audits persona phrasing and option order as treatments, codes open answers under a codebook and against gold text, and archives administrations. LLMRAgent addresses agent style experiments, while this package keeps attention on survey and experiment samples.
