ESRA 2025 Preliminary Program
All time references are in CEST
LLM-generated responses (synthetic respondents) and social science research |
Session Organisers |
Dr Yongwei Yang (Google) Mr Joseph M. Paxton (Google) Dr Mario Callegaro (Callegaro Research Methods Consulting)
|
Time | Wednesday 16 July, 11:00 - 12:30 |
Room |
Ruppert 002 |
Generative AI developments have led to great interest in replacing human survey responses with LLM-generated ones. Between October 2022 and July 2024 we have seen at least 60 papers on this topic, often posted on preview platforms (arXiv, SSRN, etc.). The enthusiasm stems from studies suggesting LLM responses (synthetic responses) can resemble human responses in public opinion and organization surveys, psychological experiments, and consumer studies. The excitement is amplified by the perceived potential for faster and cheaper data collection. However, concerns are also raised:
(1) LLM-generated data often produce smaller variance.
(2) They may not represent the full spectrum of human thoughts.
(3) They may reflect stereotypes entrenched in training data.
(4) They might not recover multivariate relationships or human mental processes.
This session will advance discussions about the utility and appropriateness of LLM-generated responses to research. We invite BOTH empirical and didactic works about:
(1) Supporting and refuting arguments on the use of LLM-generated responses
(2)Good and bad use cases of LLM-generated data
(3) Methodological challenges and solutions with LLM-vs-Human comparative studies, including research design, software and model choices, data generation process, data analysis and inferences, transparency and reproducibility
We welcome diverging or even provocative viewpoints as well as those that connect with the proliferation of other data collection practices (e.g., opt-in panels, paradata, social media data). At the same time, we stress the critical value of research rigor and expect these viewpoints to be supported by sound theory and evidence. Specifically:
(1) With didactic work, we expect depth in theoretical arguments and thoroughness in literature review.
(2) With empirical work, we expect thoughtful design and clarity about implementation (e.g., models, hyperparameters). Where applicable, we expect designs to include a temporal component that addresses changes in relevant. LLMs.
Keywords: synthetic response, synthetic data, generative AI, large language model, survey data collection
Papers
Do LLMs simulate human attitudes about technology products?
Dr Joseph Paxton (Google) - Presenting Author
Dr Yongwei Yang (Google)
How well do Large Language Models simulate human attitudes about technology products? We address this question by comparing results from a general population survey to language model responses matched (via prompting) on age, gender, country of residence, and self-reported frequency of product use. Preliminary results suggest that language model responses diverge from human responses—often dramatically—across a range of trust-related user attitude measures (information quality, privacy, etc.), within the context of Google Search. These divergent results are robust to model families (Gemini, GPT), and major updates to the models that have been made over the last several months. Based on these results—in combination with more fundamental theoretical concerns—we argue that language model responses should not be used to replace or augment human survey responses at this point in time.
Are Large Language Models Chameleons? An Attempt to Simulate Social Surveys
Mr Mingmeng Geng (SISSA) - Presenting Author
Dr Sihong He (UT Arlington)
Professor Roberto Trotta (International School for Advanced Studies (SISSA))
Can large language models (LLMs) simulate social surveys? To answer this question, we conducted millions of simulations in which LLMs were asked to answer subjective questions. Multiple LLMs were utilized in this project, including GPT-3.5, GPT-4o, LLaMA-2, LLaMA-3, Mistral, and DeepSeek-V2. A comparison of different LLM responses with the European Social Survey (ESS) data suggests that the effect of prompts on bias and variability is fundamental, highlighting major cultural, age, and gender biases. For example, simulations for Bulgarians work worse compared to people from other countries. We further discussed statistical methods for measuring the difference between LLM answers and survey data and proposed a novel measure inspired by Jaccard similarity, as LLM-generated responses are likely to have a smaller variance.
Although LLMs show the potential to perform simulations of social surveys or to replace human participants in limited settings, more advanced LLMs do not necessarily produce simulation results that are more similar to survey data. As with many LLM applications, prompts are important for our simulations, such as the order of options. The instability associated with prompts is a limitation of simulating social surveys with LLMs. Prompts that include more personal information are likely to yield better simulation results. Our experiments reveal that it is important to analyze the robustness and variability of prompts before using LLMs to simulate social surveys, as their imitation abilities are approximate at best.
Synthetic Respondents, Real Bias? Investigating AI-Generated Survey Responses
Ms Charlotte Pauline Müller (Lund University) - Presenting Author
Dr Bella Struminskaya (Utrecht University)
Dr Peter Lugtig (Utrecht University)
The idea to simulate survey respondents has lately been seen as a promising data collection tool in academia and market research. Previous research has shown that LLMs are likely to reproduce human biases and stereotypes existent in their training data. Because of this, we further investigate the potential benefits and challenges of creating synthetic response datasets by following two major aims: 1. investigate whether AI tools can replace real survey respondents, and if yes, for which questions and topics, and 2. explore whether intentional prompts reveal underlying biases in AI prediction.
We compare already existing survey data from the German General Social Survey (Allbus) 2021, to AI-generated synthetic data with the OpenAI model GPT-4. Here, we took a random sample of 100 respondents from the Allbus dataset and created a so-called AI-Agent for each. Each Agent was calibrated based on general instructions and individual background information. We chose to predict numerical, binary, and open text/string format, each of them inheriting the potential to provoke certain biases, e.g., social desirability, gender, and age stereotypes. Furthermore, each item was tested across different contextual factors, such as AI model calibration and language settings. We found a deep lack of accuracy in the simulation of survey data for both numerical (r = -0.07, p = 0.6) as well as binary outcomes (χ² (1) = 0.61, p = 0.43, V = 0.1), while the explanatory power of the background variables for the predicted outcome, was high for both the former ( = 0.4) and the latter ( = 0.25). We found no difference in the prediction accuracy between different input languages and AI model calibrations. While predicting open-text answers, individual information was generally well considered by the AI. However, several potential biases became apparent, e.g., age, gender, and regional biases.
Evaluating Methods for Producing LLM Answers to Closed-Ended Survey Questions
Mr Georg Ahnert (University of Mannheim) - Presenting Author
A growing body of research connects survey instruments and large language models (LLMs), either to simulate human survey responses (Argyle et al., 2023; inter alia), or to measure attitudes, opinions, and values embedded in LLMs. Like humans, LLMs are not designed to respond with closed-ended response options, but to produce open-ended text instead. Researchers therefore have to add additional instructions to the prompt, or extracted answers from an LLM's first-token probabilities, both of which has been shown to return survey answers that indicate vastly different tendencies than an LLM's open-ended response. A standard method for producing answers from LLMs to survey questions on attitudes, opinions, and values has yet to be identified.
We aim to benchmark established answer production methods and to investigate novel methods based on recent advances in general LLM text generation, e.g., chain-of-thought prompting, or structured outputs. We propose two benchmarking approaches: First, we establish ground-truth differences in LLMs by gradually fine-tuning them on text data labeled with expressed emotions, or self-reported personality traits. For each answer production method, we measure how well the LLM's survey response tracks the increasing fine-tuning. Second, we deploy a ground-truth-less evaluation of LLMs using answer-pooling and multitrait-multimethod approaches, to quantify the variance that is inherent to each answer production method.
Our early results indicate differences in the robustness of answer production methods to continued fine-tuning on labeled data. We also find that methods which aggregate the probabilities of all tokens in an answer option improve upon first-token based methods, but pose additional challenges in handling conditional token probabilities. With our benchmark, we want to better understand the benefits and drawbacks of the many methods for producing LLM answers to survey questions.