Skip to contents

FocusGroup Package Blueprint

This blueprint explains the architecture, logical flow, data model, and API surface of the FocusGroup R package. It is designed as a comprehensive, single reference for advanced users and, hopefully, contributors.

Overview

FocusGroup simulates and analyzes focus group discussions using LLM-backed agents. The core is built around R6 classes:

  • FGAgent: individual participant/moderator with persona, communication style, and per-agent LLM config.
  • FocusGroup: orchestrates the session across phases, manages conversation history, performs summaries and analyses, and renders visualizations.
  • ConversationFlow: abstract turn-taking mechanism with concrete implementations: RoundRobinFlow, ProbabilisticFlow, DesireBasedFlow.

High-level wrappers (run_focus_group, fg_quick) assemble agents, build a script, run the simulation, and return structured outputs. Analysis helpers (analyze_focus_group, fg_analyze_quick) compute statistics, topics, TF-IDF, readability, and LLM-assisted themes.

LLM integration is provided via the LLMR package. Prompts are modular and customizable via get_default_prompt_templates(). Utilities handle placeholder substitution, prompt history formatting, token estimation, and score parsing.

Architecture

Components and Responsibilities

  • FGAgent (R6)
    • Encapsulates persona, style, demographics, survey responses, and an LLMR::llm_config.
    • Generates utterances from prompt templates.
    • Computes “desire to talk” scores for DesireBasedFlow.
    • Tracks per-agent token usage and utterance history.
  • FocusGroup (R6)
    • Holds topic, purpose, named list of agents, moderator_id, turn_taking_flow, prompt templates, and a question script (phase guide).
    • Runs the simulation with a controlled loop, including moderator turns, participant-response rounds, interim summarization, and final summary.
    • Logs all messages with rich metadata to conversation_log.
    • Provides analysis methods and plotting utilities.
  • ConversationFlow (R6 base) and subclasses
    • Defines interface: select_next_speaker(focus_group) and update_state_post_selection(...).
    • RoundRobinFlow: cycles through non-moderator participants.
    • ProbabilisticFlow: weighted random selection with recovery dynamics.
    • DesireBasedFlow: queries LLM(s) to score each participant’s desire to speak and selects the highest above a threshold, with fallbacks.
  • Prompts and Utilities
    • get_default_prompt_templates(): returns named prompt templates for moderator phases, participant utterances, and helper analyses.
    • Placeholder utilities (replace_placeholders, replace_placeholders_known) populate dynamic context.
    • History utilities (format_conversation_history, make_prompt_history) generate concise context windows, optionally including an interim summary.
    • Token/score utilities (estimate_tokens, parse_score_0_10).

Data Model

  • Agents: Named list of FGAgent (keys are agent IDs). One agent is the moderator (moderator_id).
  • Conversation Log: FocusGroup$conversation_log is an ordered list of message records:
    • turn: integer
    • speaker_id: character (agent ID or “System”)
    • is_moderator: logical
    • text: character
    • timestamp: POSIXct
    • phase: character (e.g., opening, engagement_question, exploration_question, closing, setup). The roster is logged as a System message under the setup phase. The final summary is not logged here; it is stored in FocusGroup$final_summary.
    • response_id, finish_reason: character (from LLMR)
    • sent_tokens, rec_tokens, total_tokens: integers
    • duration_s: numeric
    • provider, model: character
  • Question Script: list of phase steps, each as list(phase = <name>, text = <optional question>).
  • Prompt Templates: a named list of strings keyed by prompt role/phase (see API reference).

Simulation Flow

  1. Initialization
    • Create agents (participants + moderator) with personas.
    • Select a conversation flow (turn-taking strategy).
    • Prepare a question script across phases (or use defaults).
    • Prepare prompts (defaults merged with any user overrides).
  2. Run Loop (FocusGroup$run_simulation)
    • Log roster as a System message to provide context for all agents.
    • For each turn until script ends or num_turns max:
      1. Determine current phase and question text (if applicable).
      2. Moderator speaks using phase-specific prompt, with placeholders populated from topic, purpose, recent history, participation stats, and current question. The moderator utterance is generated by the moderator agent via FGAgent$generate_utterance and logged.
      3. If phase expects participant responses (e.g., icebreaker/engagement/exploration/generic), use turn_taking_flow$select_next_speaker(self) to choose participants. Up to 3 participants may respond in sequence before returning control to the moderator, unless the flow selects the moderator (then the participant round ends).
      4. After each utterance, update per-group token totals from metadata.
      5. Context Management: If recent token usage or message count passes thresholds, create an interim summary (FocusGroup$summarize with level 3) and store it in current_conversation_summary so subsequent prompts use a windowed history plus summary.
    • When phase is closing, end simulation and generate a final (level 1) summary. The final summary is stored in FocusGroup$final_summary; it is not appended to conversation_log.
  3. Outputs
    • conversation_log is the canonical transcript source.
    • Wrappers optionally return a data frame transcript, summary text, basic stats, and more.

Turn-Taking Strategies

  • RoundRobinFlow
    • Maintains a simple index over participant_ids and returns the next participant.
  • ProbabilisticFlow
    • Maintains propensities and base_propensities for all agents.
    • After a speaker is selected, that speaker’s propensity resets toward 0 (participants) or halves (moderator), while others recover toward their base at a configured rate.
    • Selection draws proportionally to current propensities, with simple guardrails against immediate self-succession by a participant.
  • DesireBasedFlow
    • Computes per-participant desire scores (0-10) via LLM prompts tailored with persona, current question, last speaker, and recent history. Scoring is broadcast within groups of agents that share one model config: broadcasting requires a single config, so agents with differing providers/models/params are grouped separately and scored per group. The common case (all agents share a config) is one broadcast. The path that ran is recorded in last_scoring_mode.
    • Enforces a minimum score threshold; if none clears it, falls back to the maximum-scored participant to keep discussion moving.

Prompt System

  • Moderator prompts are provided for phases: opening, icebreaker_question, engagement_question, exploration_question, probing_focused, summarizing, transition, manage_participation, ending_question, closing, plus a generic fallback.
  • Participant prompts include utterance generation and desire-to-talk scoring.
  • Helper prompts include question suggestion, persona generation, and thematic analysis.
  • Placeholders support topic, purpose, persona, conversation history, current question, participation stats (dominant/quiet speakers), and more. Unknown placeholders are preserved by replace_placeholders_known to avoid accidental removal of tokens used by downstream steps.

LLM Integration and Cost Controls

  • Each FGAgent has a dedicated LLMR::llm_config (provider, model, temperature, max_tokens). Group-level admin tasks (summaries, thematic analysis) use FocusGroup$llm_config_admin.
  • Token accounting: per-agent tokens are recorded by FGAgent; group totals tracked in FocusGroup.
  • Context control: history windows are reduced and summarized when thresholds are crossed.
  • Defaults constrain max tokens for moderator/participant/desire queries. Choosing RoundRobin flow is cheaper than DesireBased flow.

Reproducibility and Options

  • run_focus_group and fg_quick accept seed; create_agents_from_survey and utilities read getOption("focusgroup.seed") if set (via .onLoad).

API Reference

This section lists classes and functions, their inputs/outputs, and behavior. “Exported” denotes public API (present in NAMESPACE). Internals are stable for package use but may change more frequently.

R6 Classes (Exported)

FGAgent

  • Fields: id, persona_description, communication_style_instruction, model_config, role, demographics (list), survey_responses (list), is_moderator, history (list), tokens_sent_agent, tokens_received_agent.
  • initialize(id, agent_details, model_config, is_moderator = FALSE)
    • agent_details: list with optional demographics, survey_responses, direct_persona_description, communication_style.
    • Validates and constructs persona and style (moderator has sensible defaults).
  • generate_utterance(topic, conversation_history_string, utterance_prompt_template, max_tokens_utterance = 150, current_moderator_question = “N/A”, conversation_summary_so_far = “N/A”, current_phase = “discussion”) -> list(text, meta)
    • Calls LLM, cleans self-referential openings, retries if incomplete, updates tokens and history.
  • get_need_to_talk(topic, conversation_history_string, desire_prompt_template, max_tokens_desire = 20, current_moderator_question = “N/A”, last_speaker_id = “N/A”, last_utterance_text = “N/A”) -> numeric score 0–10

Private helper: construct_persona_elements(details, is_moderator_flag) -> list(description, style_instruction)

FocusGroup

  • Fields: topic, purpose, agents (named list of FGAgent), moderator_id, conversation_log (list), turn_taking_flow, prompt_templates (list), question_script (list), current_phase_index, current_question_text, current_conversation_summary, llm_config_admin, max_tokens_utterance, max_tokens_moderator, max_tokens_desire, total_tokens_sent, total_tokens_received.
  • initialize(topic, purpose, agents, moderator_id, turn_taking_flow, question_script = list(), prompt_templates = list(), llm_config_admin = NULL, max_tokens_config = list())
    • Merges defaults with overrides, builds minimal script if not provided, sets admin LLM config and token caps.
  • run_simulation(num_turns = NULL, verbose = FALSE) -> conversation_log (invisibly)
    • Executes the full loop, manages interim summaries and final summary.
  • advance_turn(current_turn_number, verbose = FALSE) -> logical continue
    • Executes one moderator step plus a small participant response round (up to 3) if phase expects responses.
  • summarize(llm_config = NULL, summary_level = 1, max_tokens = NULL, internal_call = FALSE, transcript_override = NULL) -> character
    • Levels: 1 (prose overview), 2 (detailed bullets), 3 (short bullets).
  • analyze(turns = NULL, speaker_ids = NULL) -> list(speaker_stats tibble, full_transcript character)
  • analyze_topics(num_topics = 5, min_doc_length = 20, top_n_terms = 10, turns = NULL, speaker_ids = NULL, …) -> list or NULL
  • analyze_tfidf(top_n_terms = 10, turns = NULL, speaker_ids = NULL, …) -> tibble(speaker_id, term, tf_idf)
  • analyze_readability(measures = “Flesch”, turns = NULL, speaker_ids = NULL) -> tibble(measures…, speaker_id)
  • analyze_themes(llm_config = NULL, turns = NULL, speaker_ids = NULL) -> character (themes summary; raw response in attribute)
  • analyze_statistics(turns = NULL, speaker_ids = NULL) -> list(word_count_anova, phase_participation tibble, turns_words_correlation test)
  • analyze_participation_balance(turns = NULL, speaker_ids = NULL) -> list(participation_stats tibble, balance_metrics list)
  • analyze_response_patterns(turns = NULL, speaker_ids = NULL) -> list(response_metrics tibble, interaction_metrics tibble)
  • analyze_question_patterns(turns = NULL, speaker_ids = NULL) -> list(question_patterns tibble, question_distribution tibble)
  • analyze_key_phrases(min_freq = 2, turns = NULL, speaker_ids = NULL) -> list(bigrams df, trigrams df, totals)
  • plot_participation_timeline() -> ggplot
  • plot_word_count_distribution() -> ggplot
  • plot_participation_by_agent() -> ggplot
  • plot_turn_length_timeline() -> ggplot

Private helpers: get_next_phase_or_question(), log_message(...), get_filtered_log(...), get_recent_transcript_for_summary(n).

ConversationFlow (base)

  • Fields: agents, agent_ids, participant_ids, moderator_id, last_speaker_id.
  • initialize(agents, moderator_id)
  • select_next_speaker(focus_group) -> FGAgent or NULL (abstract)
  • update_state_post_selection(speaker_id, focus_group)

RoundRobinFlow

  • initialize(agents, moderator_id)
  • select_next_speaker(focus_group) -> FGAgent (participant)

ProbabilisticFlow

  • Fields: propensities, base_propensities, recovery_increment.
  • initialize(agents, moderator_id, initial_propensities = NULL, recovery_increment = 0.1)
  • select_next_speaker(focus_group) -> FGAgent
  • update_state_post_selection(speaker_id, focus_group)

DesireBasedFlow

  • Fields: last_desire_scores (named numeric), last_scoring_mode (character: “broadcast_shared_config”, “broadcast_grouped_config”, or “per_agent”), min_desire_threshold.
  • initialize(agents, moderator_id, min_desire_threshold = 3)
  • select_next_speaker(focus_group) -> FGAgent (participant); falls back to the maximum-scored participant when none clears the threshold, returning NULL only when there are no participants.
  • get_last_desire_scores() -> named numeric

Factories and Wrappers (Exported)

create_conversation_flow(mode, agents, moderator_id, flow_params = list()) -> ConversationFlow

  • mode: “round_robin” | “probabilistic” | “desire_based”.
  • flow_params: initial_propensities, recovery_increment, min_desire_threshold as applicable.

run_focus_group(topic, participants = 6, turns_per_phase = c(Opening = 2, Icebreaker = 3, Engagement = 8, Exploration = 10, Closing = 2), demographics = NULL, survey_responses = NULL, conversation_flow = “desire_based”, llm_config = NULL, seed = NULL, verbose = TRUE) -> list

  • Returns:
    • focus_group: FocusGroup object
    • conversation: data.frame transcript extracted from conversation_log
    • summary: character (overall summary)
    • basic_stats: from FocusGroup$analyze()
    • participants: list with id, moderator flag, persona, and input data

fg_quick(topic, participants = 6, flow = c(“desire_based”,“round_robin”,“probabilistic”), model_config = NULL, seed = NULL, mode = c(“quick”,“pro”), verbose = TRUE) -> list

  • Returns: transcript tibble, summary character, participants list, totals (tokens and turns), config_meta, and the focus_group.

fg_analyze_quick(res) -> list

  • Input: fg_quick() result or a FocusGroup object.
  • Returns: basic_stats and short_summary (summary level 3).

analyze_focus_group(focus_group_result, num_topics = 5, include_plots = TRUE, sentiment_method = “afinn”) -> list

  • Input: a FocusGroup object or the result of run_focus_group().
  • Returns: list { basic_stats, topics, sentiment=NULL, tfidf, readability, themes, plots? }.

Agent Creation and Survey Integration

create_diverse_agents(n_participants, demographics = NULL, survey_responses = NULL, llm_config = NULL) -> list of FGAgent

  • Creates n_participants participants plus one moderator (MOD).
  • If demographics/survey_responses are missing, uses generators below.
  • Each participant’s persona is built via generate_persona().

create_agents_from_survey(n_participants, survey_path, demographic_vars = NULL, survey_vars = NULL, llm_config = NULL) -> list of FGAgent

  • Loads haven-labeled survey files (.dta, .sav, .sas7bdat) and converts value labels to characters.
  • Auto-detects common demographics and survey variables (with ANES 2024-specific mappings).
  • Samples rows to create agent inputs and delegates to create_diverse_agents().

generate_diverse_demographics(n) -> data.frame (internal)

  • Columns: age, gender, education, income, location.

generate_survey_responses(n) -> data.frame (internal)

  • Example Likert-like variables: tech_usage_comfort, social_media_frequency, privacy_concern_level, environmental_concern.

generate_persona(demographics, survey_responses = NULL) -> character (internal)

  • Builds a succinct persona paragraph from demographics and optional survey attributes; includes cautious handling of values and simple heuristics for traits.

Prompts and Formatting

get_default_prompt_templates() -> named list (exported)

  • Keys (non-exhaustive):
    • Participant: participant_utterance_subtle_persona, participant_desire_to_talk_nuanced.
    • Moderator: moderator_opening, moderator_icebreaker_question, moderator_engagement_question, moderator_exploration_question, moderator_probing_focused, moderator_summarizing, moderator_transition, moderator_manage_participation, moderator_ending_question, moderator_closing, moderator_generic_utterance.
    • Helpers: suggest_questions_prompt, generate_persona_prompt, thematic_analysis_prompt, sentiment_analysis_prompt.

format_demographics(demographics) -> character (exported)

  • Filters empty values and returns “key: value; key: value; …” or an informative fallback string.

format_survey_responses(survey_responses) -> character (exported)

  • Produces a readable multi-line block from a named list of question-answer pairs.

format_conversation_history(conversation_log, n_recent = 7, include_summary = NULL) -> character (exported)

  • Formats recent turns (optionally prefixed with a summary) for prompts.

make_prompt_history(log, n_recent = 5, include_summary = NULL) -> character (internal)

  • Thin wrapper over format_conversation_history used throughout the codebase.

Placeholder Utilities

replace_placeholders(template_string, values_list) -> character (exported)

  • Replaces all {{key}} placeholders found in the template with given values; unknown placeholders are removed.

replace_placeholders_known(template_string, values_list) -> character (internal)

  • Only replaces placeholders present in values_list, preserving other {{...}} tokens for later resolution.

Token and Scoring Utilities (Internal)

estimate_tokens(text) -> integer

  • Rough token estimate: character length / 4.

parse_score_0_10(text) -> integer in [0,10]

  • Extracts a single integer desire score from free text.

Visualization

All plotting methods live on FocusGroup and return ggplot objects: - plot_participation_timeline() - plot_word_count_distribution() - plot_participation_by_agent() - plot_turn_length_timeline()

Conversation Log Schema (Detailed)

Each log entry is created via a private logger and includes:

  • turn: integer (auto-assigned sequentially; system messages may use the current turn context)
  • speaker_id: character (agent ID or “System”)
  • is_moderator: logical (inferred for moderator ID if not provided)
  • text: character (utterance or system content)
  • timestamp: POSIXct (time of logging)
  • phase: character (script phase or special markers like setup, final_summary)
  • response_id: character (LLMR response ID when available)
  • finish_reason: character (e.g., stop, length)
  • sent_tokens, rec_tokens, total_tokens: integers (LLMR token accounting when available)
  • duration_s: numeric (provider-reported duration)
  • provider, model: character

Error Handling and Edge Cases

  • Strict input validation on constructors and wrappers (e.g., non-empty topic, presence of moderator, correct flow types).
  • Missing prompts fall back to a generic moderator template.
  • Empty logs yield informative messages and guarded behaviors in analysis and plotting.
  • Topic modeling, TF-IDF, readability gracefully return NULL/empty tibbles with warnings if data is insufficient.

Extensibility Guidelines

  • New Conversation Flow
    • Subclass ConversationFlow, implement select_next_speaker() and (optionally) override update_state_post_selection().
    • Add a case to create_conversation_flow().
  • Custom Prompts
  • New Analyses
    • Add methods on FocusGroup using private$get_filtered_log() to reuse consistent filtering and transcript assembly.
  • Alternative LLM Providers
    • Configure per-agent LLMR::llm_config accordingly. Group-level tasks use llm_config_admin if set; else fall back to the moderator’s config.

Directory Structure

  • R/FGAgent.R: agent class
  • R/FocusGroup.R: group orchestration, analysis, plots
  • R/ConversationFlow.R: base and flow implementations; create_conversation_flow()
  • R/convenience_wrappers.R: wrappers, analysis entry points, data generators, survey integration
  • R/prompts.R: default prompt templates
  • R/utils.R: formatting, placeholders, prompt history and token utilities
  • R/zzz.R: package docs and .onLoad options

Exported Symbols (Public API)

  • Classes: FGAgent, FocusGroup, ConversationFlow, RoundRobinFlow, ProbabilisticFlow, DesireBasedFlow
  • Functions: run_focus_group, fg_quick, fg_analyze_quick, analyze_focus_group, create_diverse_agents, create_agents_from_survey, create_conversation_flow, get_default_prompt_templates, format_demographics, format_survey_responses, format_conversation_history, replace_placeholders

Internals documented but not exported: generate_diverse_demographics, generate_survey_responses, generate_persona, estimate_tokens, parse_score_0_10, make_prompt_history, replace_placeholders_known.

Known Limitations and Notes

  • Desire-based flow can be costlier (more LLM calls) than round-robin; choose based on budget/needs.
  • Summarization uses LLM and contributes to token usage; thresholds are conservative defaults.
  • Topic modeling requires sufficient term diversity; small or very short transcripts may not yield meaningful results.
  • Sentiment analysis helpers are present in prompts, but sentiment computation is not currently part of the exported analysis pipeline.

Quick Usage Patterns

  • Rapid run:
    • fg_quick(topic = "...", participants = 4)
  • Full run and analyze:
    • res <- run_focus_group(topic = "...", participants = 6)
    • analysis <- analyze_focus_group(res, num_topics = 5, include_plots = TRUE)

Change Impact Matrix (High-Level)

  • Prompt changes: affect agent outputs; no structural changes.
  • Flow changes: affect speaker selection; ensure selection invariants (participants vs. moderator) align with phase logic.
  • Analysis additions: safe; prefer using private$get_filtered_log() and return tibbles/lists consistently.
  • LLM config changes: affect cost/latency; ensure token caps are respected.

This blueprint should be kept in sync with Roxygen documentation and NAMESPACE. Update when classes, flows, prompts, or analysis capabilities change.