FocusGroup Package Blueprint
This blueprint explains the architecture, logical flow, data model, and API surface of the FocusGroup R package. It is designed as a comprehensive, single reference for advanced users and, hopefully, contributors.
Overview
FocusGroup simulates and analyzes focus group discussions using LLM-backed agents. The core is built around R6 classes:
- FGAgent: individual participant/moderator with persona, communication style, and per-agent LLM config.
- FocusGroup: orchestrates the session across phases, manages conversation history, performs summaries and analyses, and renders visualizations.
- ConversationFlow: abstract turn-taking mechanism with concrete implementations: RoundRobinFlow, ProbabilisticFlow, DesireBasedFlow.
High-level wrappers (run_focus_group,
fg_quick) assemble agents, build a script, run the
simulation, and return structured outputs. Analysis helpers
(analyze_focus_group, fg_analyze_quick)
compute statistics, topics, TF-IDF, readability, and LLM-assisted
themes.
LLM integration is provided via the LLMR package. Prompts are modular
and customizable via get_default_prompt_templates().
Utilities handle placeholder substitution, prompt history formatting,
token estimation, and score parsing.
Architecture
Components and Responsibilities
- FGAgent (R6)
- Encapsulates persona, style, demographics, survey responses, and an
LLMR::llm_config. - Generates utterances from prompt templates.
- Computes “desire to talk” scores for DesireBasedFlow.
- Tracks per-agent token usage and utterance history.
- Encapsulates persona, style, demographics, survey responses, and an
- FocusGroup (R6)
- Holds topic, purpose, named list of agents,
moderator_id,turn_taking_flow, prompt templates, and a question script (phase guide). - Runs the simulation with a controlled loop, including moderator turns, participant-response rounds, interim summarization, and final summary.
- Logs all messages with rich metadata to
conversation_log. - Provides analysis methods and plotting utilities.
- Holds topic, purpose, named list of agents,
- ConversationFlow (R6 base) and subclasses
- Defines interface:
select_next_speaker(focus_group)andupdate_state_post_selection(...). - RoundRobinFlow: cycles through non-moderator participants.
- ProbabilisticFlow: weighted random selection with recovery dynamics.
- DesireBasedFlow: queries LLM(s) to score each participant’s desire to speak and selects the highest above a threshold, with fallbacks.
- Defines interface:
- Prompts and Utilities
-
get_default_prompt_templates(): returns named prompt templates for moderator phases, participant utterances, and helper analyses. - Placeholder utilities (
replace_placeholders,replace_placeholders_known) populate dynamic context. - History utilities (
format_conversation_history,make_prompt_history) generate concise context windows, optionally including an interim summary. - Token/score utilities (
estimate_tokens,parse_score_0_10).
-
Data Model
- Agents: Named list of
FGAgent(keys are agent IDs). One agent is the moderator (moderator_id). - Conversation Log:
FocusGroup$conversation_logis an ordered list of message records:- turn: integer
- speaker_id: character (agent ID or “System”)
- is_moderator: logical
- text: character
- timestamp: POSIXct
- phase: character (e.g., opening, engagement_question,
exploration_question, closing, setup). The roster is logged as a System
message under the
setupphase. The final summary is not logged here; it is stored inFocusGroup$final_summary. - response_id, finish_reason: character (from LLMR)
- sent_tokens, rec_tokens, total_tokens: integers
- duration_s: numeric
- provider, model: character
- Question Script: list of phase steps, each as
list(phase = <name>, text = <optional question>). - Prompt Templates: a named list of strings keyed by prompt role/phase (see API reference).
Simulation Flow
- Initialization
- Create agents (participants + moderator) with personas.
- Select a conversation flow (turn-taking strategy).
- Prepare a question script across phases (or use defaults).
- Prepare prompts (defaults merged with any user overrides).
- Run Loop (
FocusGroup$run_simulation)- Log roster as a System message to provide context for all agents.
- For each turn until script ends or
num_turnsmax:- Determine current phase and question text (if applicable).
- Moderator speaks using phase-specific prompt, with placeholders
populated from topic, purpose, recent history, participation stats, and
current question. The moderator utterance is generated by the moderator
agent via
FGAgent$generate_utteranceand logged. - If phase expects participant responses (e.g.,
icebreaker/engagement/exploration/generic), use
turn_taking_flow$select_next_speaker(self)to choose participants. Up to 3 participants may respond in sequence before returning control to the moderator, unless the flow selects the moderator (then the participant round ends). - After each utterance, update per-group token totals from metadata.
- Context Management: If recent token usage or message count passes
thresholds, create an interim summary (
FocusGroup$summarizewith level 3) and store it incurrent_conversation_summaryso subsequent prompts use a windowed history plus summary.
- When phase is closing, end simulation and generate a final (level 1)
summary. The final summary is stored in
FocusGroup$final_summary; it is not appended toconversation_log.
- Outputs
-
conversation_logis the canonical transcript source. - Wrappers optionally return a data frame transcript, summary text, basic stats, and more.
-
Turn-Taking Strategies
- RoundRobinFlow
- Maintains a simple index over
participant_idsand returns the next participant.
- Maintains a simple index over
- ProbabilisticFlow
- Maintains
propensitiesandbase_propensitiesfor all agents. - After a speaker is selected, that speaker’s propensity resets toward 0 (participants) or halves (moderator), while others recover toward their base at a configured rate.
- Selection draws proportionally to current propensities, with simple guardrails against immediate self-succession by a participant.
- Maintains
- DesireBasedFlow
- Computes per-participant desire scores (0-10) via LLM prompts
tailored with persona, current question, last speaker, and recent
history. Scoring is broadcast within groups of agents that share one
model config: broadcasting requires a single config, so agents with
differing providers/models/params are grouped separately and scored per
group. The common case (all agents share a config) is one broadcast. The
path that ran is recorded in
last_scoring_mode. - Enforces a minimum score threshold; if none clears it, falls back to the maximum-scored participant to keep discussion moving.
- Computes per-participant desire scores (0-10) via LLM prompts
tailored with persona, current question, last speaker, and recent
history. Scoring is broadcast within groups of agents that share one
model config: broadcasting requires a single config, so agents with
differing providers/models/params are grouped separately and scored per
group. The common case (all agents share a config) is one broadcast. The
path that ran is recorded in
Prompt System
- Moderator prompts are provided for phases: opening, icebreaker_question, engagement_question, exploration_question, probing_focused, summarizing, transition, manage_participation, ending_question, closing, plus a generic fallback.
- Participant prompts include utterance generation and desire-to-talk scoring.
- Helper prompts include question suggestion, persona generation, and thematic analysis.
- Placeholders support topic, purpose, persona, conversation history,
current question, participation stats (dominant/quiet speakers), and
more. Unknown placeholders are preserved by
replace_placeholders_knownto avoid accidental removal of tokens used by downstream steps.
LLM Integration and Cost Controls
- Each
FGAgenthas a dedicatedLLMR::llm_config(provider, model, temperature, max_tokens). Group-level admin tasks (summaries, thematic analysis) useFocusGroup$llm_config_admin. - Token accounting: per-agent tokens are recorded by FGAgent; group totals tracked in FocusGroup.
- Context control: history windows are reduced and summarized when thresholds are crossed.
- Defaults constrain max tokens for moderator/participant/desire queries. Choosing RoundRobin flow is cheaper than DesireBased flow.
API Reference
This section lists classes and functions, their inputs/outputs, and
behavior. “Exported” denotes public API (present in
NAMESPACE). Internals are stable for package use but may
change more frequently.
R6 Classes (Exported)
FGAgent
- Fields:
id,persona_description,communication_style_instruction,model_config,role,demographics(list),survey_responses(list),is_moderator,history(list),tokens_sent_agent,tokens_received_agent. - initialize(id, agent_details, model_config, is_moderator = FALSE)
- agent_details: list with optional
demographics,survey_responses,direct_persona_description,communication_style. - Validates and constructs persona and style (moderator has sensible defaults).
- agent_details: list with optional
- generate_utterance(topic, conversation_history_string,
utterance_prompt_template, max_tokens_utterance = 150,
current_moderator_question = “N/A”, conversation_summary_so_far = “N/A”,
current_phase = “discussion”) -> list(text, meta)
- Calls LLM, cleans self-referential openings, retries if incomplete,
updates tokens and
history.
- Calls LLM, cleans self-referential openings, retries if incomplete,
updates tokens and
- get_need_to_talk(topic, conversation_history_string, desire_prompt_template, max_tokens_desire = 20, current_moderator_question = “N/A”, last_speaker_id = “N/A”, last_utterance_text = “N/A”) -> numeric score 0–10
Private helper: construct_persona_elements(details, is_moderator_flag) -> list(description, style_instruction)
FocusGroup
- Fields:
topic,purpose,agents(named list of FGAgent),moderator_id,conversation_log(list),turn_taking_flow,prompt_templates(list),question_script(list),current_phase_index,current_question_text,current_conversation_summary,llm_config_admin,max_tokens_utterance,max_tokens_moderator,max_tokens_desire,total_tokens_sent,total_tokens_received. - initialize(topic, purpose, agents, moderator_id, turn_taking_flow,
question_script = list(), prompt_templates = list(), llm_config_admin =
NULL, max_tokens_config = list())
- Merges defaults with overrides, builds minimal script if not provided, sets admin LLM config and token caps.
- run_simulation(num_turns = NULL, verbose = FALSE) ->
conversation_log (invisibly)
- Executes the full loop, manages interim summaries and final summary.
- advance_turn(current_turn_number, verbose = FALSE) -> logical
continue
- Executes one moderator step plus a small participant response round (up to 3) if phase expects responses.
- summarize(llm_config = NULL, summary_level = 1, max_tokens = NULL,
internal_call = FALSE, transcript_override = NULL) -> character
- Levels: 1 (prose overview), 2 (detailed bullets), 3 (short bullets).
- analyze(turns = NULL, speaker_ids = NULL) -> list(speaker_stats tibble, full_transcript character)
- analyze_topics(num_topics = 5, min_doc_length = 20, top_n_terms = 10, turns = NULL, speaker_ids = NULL, …) -> list or NULL
- analyze_tfidf(top_n_terms = 10, turns = NULL, speaker_ids = NULL, …) -> tibble(speaker_id, term, tf_idf)
- analyze_readability(measures = “Flesch”, turns = NULL, speaker_ids = NULL) -> tibble(measures…, speaker_id)
- analyze_themes(llm_config = NULL, turns = NULL, speaker_ids = NULL) -> character (themes summary; raw response in attribute)
- analyze_statistics(turns = NULL, speaker_ids = NULL) -> list(word_count_anova, phase_participation tibble, turns_words_correlation test)
- analyze_participation_balance(turns = NULL, speaker_ids = NULL) -> list(participation_stats tibble, balance_metrics list)
- analyze_response_patterns(turns = NULL, speaker_ids = NULL) -> list(response_metrics tibble, interaction_metrics tibble)
- analyze_question_patterns(turns = NULL, speaker_ids = NULL) -> list(question_patterns tibble, question_distribution tibble)
- analyze_key_phrases(min_freq = 2, turns = NULL, speaker_ids = NULL) -> list(bigrams df, trigrams df, totals)
- plot_participation_timeline() -> ggplot
- plot_word_count_distribution() -> ggplot
- plot_participation_by_agent() -> ggplot
- plot_turn_length_timeline() -> ggplot
Private helpers: get_next_phase_or_question(),
log_message(...), get_filtered_log(...),
get_recent_transcript_for_summary(n).
ConversationFlow (base)
- Fields:
agents,agent_ids,participant_ids,moderator_id,last_speaker_id. - initialize(agents, moderator_id)
- select_next_speaker(focus_group) -> FGAgent or NULL (abstract)
- update_state_post_selection(speaker_id, focus_group)
RoundRobinFlow
- initialize(agents, moderator_id)
- select_next_speaker(focus_group) -> FGAgent (participant)
ProbabilisticFlow
- Fields:
propensities,base_propensities,recovery_increment. - initialize(agents, moderator_id, initial_propensities = NULL, recovery_increment = 0.1)
- select_next_speaker(focus_group) -> FGAgent
- update_state_post_selection(speaker_id, focus_group)
DesireBasedFlow
- Fields:
last_desire_scores(named numeric),last_scoring_mode(character: “broadcast_shared_config”, “broadcast_grouped_config”, or “per_agent”),min_desire_threshold. - initialize(agents, moderator_id, min_desire_threshold = 3)
- select_next_speaker(focus_group) -> FGAgent (participant); falls
back to the maximum-scored participant when none clears the threshold,
returning
NULLonly when there are no participants. - get_last_desire_scores() -> named numeric
Factories and Wrappers (Exported)
create_conversation_flow(mode, agents, moderator_id, flow_params = list()) -> ConversationFlow
- mode: “round_robin” | “probabilistic” | “desire_based”.
- flow_params:
initial_propensities,recovery_increment,min_desire_thresholdas applicable.
run_focus_group(topic, participants = 6, turns_per_phase = c(Opening = 2, Icebreaker = 3, Engagement = 8, Exploration = 10, Closing = 2), demographics = NULL, survey_responses = NULL, conversation_flow = “desire_based”, llm_config = NULL, seed = NULL, verbose = TRUE) -> list
- Returns:
- focus_group: FocusGroup object
- conversation: data.frame transcript extracted from
conversation_log - summary: character (overall summary)
- basic_stats: from
FocusGroup$analyze() - participants: list with id, moderator flag, persona, and input data
fg_quick(topic, participants = 6, flow = c(“desire_based”,“round_robin”,“probabilistic”), model_config = NULL, seed = NULL, mode = c(“quick”,“pro”), verbose = TRUE) -> list
- Returns: transcript tibble, summary character, participants list,
totals (tokens and turns), config_meta, and the
focus_group.
fg_analyze_quick(res) -> list
- Input:
fg_quick()result or aFocusGroupobject. - Returns:
basic_statsandshort_summary(summary level 3).
analyze_focus_group(focus_group_result, num_topics = 5, include_plots = TRUE, sentiment_method = “afinn”) -> list
- Input: a
FocusGroupobject or the result ofrun_focus_group(). - Returns: list
{ basic_stats, topics, sentiment=NULL, tfidf, readability, themes, plots? }.
Agent Creation and Survey Integration
create_diverse_agents(n_participants, demographics = NULL, survey_responses = NULL, llm_config = NULL) -> list of FGAgent
- Creates
n_participantsparticipants plus one moderator (MOD). - If
demographics/survey_responsesare missing, uses generators below. - Each participant’s persona is built via
generate_persona().
create_agents_from_survey(n_participants, survey_path, demographic_vars = NULL, survey_vars = NULL, llm_config = NULL) -> list of FGAgent
- Loads haven-labeled survey files (
.dta,.sav,.sas7bdat) and converts value labels to characters. - Auto-detects common demographics and survey variables (with ANES 2024-specific mappings).
- Samples rows to create agent inputs and delegates to
create_diverse_agents().
generate_diverse_demographics(n) -> data.frame (internal)
- Columns:
age,gender,education,income,location.
Prompts and Formatting
get_default_prompt_templates() -> named list (exported)
- Keys (non-exhaustive):
- Participant:
participant_utterance_subtle_persona,participant_desire_to_talk_nuanced. - Moderator:
moderator_opening,moderator_icebreaker_question,moderator_engagement_question,moderator_exploration_question,moderator_probing_focused,moderator_summarizing,moderator_transition,moderator_manage_participation,moderator_ending_question,moderator_closing,moderator_generic_utterance. - Helpers:
suggest_questions_prompt,generate_persona_prompt,thematic_analysis_prompt,sentiment_analysis_prompt.
- Participant:
format_demographics(demographics) -> character (exported)
- Filters empty values and returns “key: value; key: value; …” or an informative fallback string.
format_survey_responses(survey_responses) -> character (exported)
- Produces a readable multi-line block from a named list of question-answer pairs.
Placeholder Utilities
Visualization
All plotting methods live on FocusGroup and return
ggplot objects: - plot_participation_timeline() -
plot_word_count_distribution() - plot_participation_by_agent() -
plot_turn_length_timeline()
Conversation Log Schema (Detailed)
Each log entry is created via a private logger and includes:
- turn: integer (auto-assigned sequentially; system messages may use the current turn context)
- speaker_id: character (agent ID or “System”)
- is_moderator: logical (inferred for moderator ID if not provided)
- text: character (utterance or system content)
- timestamp: POSIXct (time of logging)
- phase: character (script phase or special markers like
setup,final_summary) - response_id: character (LLMR response ID when available)
- finish_reason: character (e.g., stop, length)
- sent_tokens, rec_tokens, total_tokens: integers (LLMR token accounting when available)
- duration_s: numeric (provider-reported duration)
- provider, model: character
Error Handling and Edge Cases
- Strict input validation on constructors and wrappers (e.g., non-empty topic, presence of moderator, correct flow types).
- Missing prompts fall back to a generic moderator template.
- Empty logs yield informative messages and guarded behaviors in analysis and plotting.
- Topic modeling, TF-IDF, readability gracefully return
NULL/empty tibbles with warnings if data is insufficient.
Extensibility Guidelines
- New Conversation Flow
- Subclass
ConversationFlow, implementselect_next_speaker()and (optionally) overrideupdate_state_post_selection(). - Add a case to
create_conversation_flow().
- Subclass
- Custom Prompts
- Retrieve defaults with
get_default_prompt_templates(), modify any entries, and pass them toFocusGroup$new(..., prompt_templates = ...).
- Retrieve defaults with
- New Analyses
- Add methods on
FocusGroupusingprivate$get_filtered_log()to reuse consistent filtering and transcript assembly.
- Add methods on
- Alternative LLM Providers
- Configure per-agent
LLMR::llm_configaccordingly. Group-level tasks usellm_config_adminif set; else fall back to the moderator’s config.
- Configure per-agent
Directory Structure
-
R/FGAgent.R: agent class -
R/FocusGroup.R: group orchestration, analysis, plots -
R/ConversationFlow.R: base and flow implementations;create_conversation_flow() -
R/convenience_wrappers.R: wrappers, analysis entry points, data generators, survey integration -
R/prompts.R: default prompt templates -
R/utils.R: formatting, placeholders, prompt history and token utilities -
R/zzz.R: package docs and.onLoadoptions
Exported Symbols (Public API)
- Classes:
FGAgent,FocusGroup,ConversationFlow,RoundRobinFlow,ProbabilisticFlow,DesireBasedFlow - Functions:
run_focus_group,fg_quick,fg_analyze_quick,analyze_focus_group,create_diverse_agents,create_agents_from_survey,create_conversation_flow,get_default_prompt_templates,format_demographics,format_survey_responses,format_conversation_history,replace_placeholders
Internals documented but not exported:
generate_diverse_demographics,
generate_survey_responses, generate_persona,
estimate_tokens, parse_score_0_10,
make_prompt_history,
replace_placeholders_known.
Known Limitations and Notes
- Desire-based flow can be costlier (more LLM calls) than round-robin; choose based on budget/needs.
- Summarization uses LLM and contributes to token usage; thresholds are conservative defaults.
- Topic modeling requires sufficient term diversity; small or very short transcripts may not yield meaningful results.
- Sentiment analysis helpers are present in prompts, but sentiment computation is not currently part of the exported analysis pipeline.
Quick Usage Patterns
- Rapid run:
fg_quick(topic = "...", participants = 4)
- Full run and analyze:
res <- run_focus_group(topic = "...", participants = 6)analysis <- analyze_focus_group(res, num_topics = 5, include_plots = TRUE)
Change Impact Matrix (High-Level)
- Prompt changes: affect agent outputs; no structural changes.
- Flow changes: affect speaker selection; ensure selection invariants (participants vs. moderator) align with phase logic.
- Analysis additions: safe; prefer using
private$get_filtered_log()and return tibbles/lists consistently. - LLM config changes: affect cost/latency; ensure token caps are respected.
This blueprint should be kept in sync with Roxygen documentation and
NAMESPACE. Update when classes, flows, prompts, or analysis
capabilities change.
