ESRA logo

ESRA 2023 Glance Program


All time references are in CEST

LLM-generated responses (synthetic respondents) and social science research

Session Organisers Dr Yongwei Yang (Google)
Mr Joseph M. Paxton (Google)
Dr Mario Callegaro (Callegaro Research Methods Consulting)
TimeTuesday 18 July, 09:00 - 10:30
Room

Generative AI developments have led to great interest in replacing human survey responses with LLM-generated ones. Between October 2022 and July 2024 we have seen at least 60 papers on this topic, often posted on preview platforms (arXiv, SSRN, etc.). The enthusiasm stems from studies suggesting LLM responses (synthetic responses) can resemble human responses in public opinion and organization surveys, psychological experiments, and consumer studies. The excitement is amplified by the perceived potential for faster and cheaper data collection. However, concerns are also raised:
(1) LLM-generated data often produce smaller variance.
(2) They may not represent the full spectrum of human thoughts.
(3) They may reflect stereotypes entrenched in training data.
(4) They might not recover multivariate relationships or human mental processes.

This session will advance discussions about the utility and appropriateness of LLM-generated responses to research. We invite BOTH empirical and didactic works about:
(1) Supporting and refuting arguments on the use of LLM-generated responses
(2)Good and bad use cases of LLM-generated data
(3) Methodological challenges and solutions with LLM-vs-Human comparative studies, including research design, software and model choices, data generation process, data analysis and inferences, transparency and reproducibility

We welcome diverging or even provocative viewpoints as well as those that connect with the proliferation of other data collection practices (e.g., opt-in panels, paradata, social media data). At the same time, we stress the critical value of research rigor and expect these viewpoints to be supported by sound theory and evidence. Specifically:
(1) With didactic work, we expect depth in theoretical arguments and thoroughness in literature review.
(2) With empirical work, we expect thoughtful design and clarity about implementation (e.g., models, hyperparameters). Where applicable, we expect designs to include a temporal component that addresses changes in relevant. LLMs.

Keywords: synthetic response, synthetic data, generative AI, large language model, survey data collection

Papers

Do LLMs simulate human attitudes about technology products?

Dr Joseph Paxton (Google) - Presenting Author
Dr Yongwei Yang (Google)

How well do Large Language Models simulate human attitudes about technology products? We address this question by comparing results from a general population survey to language model responses matched (via prompting) on age, gender, country of residence, and self-reported frequency of product use. Preliminary results suggest that language model responses diverge from human responses—often dramatically—across a range of trust-related user attitude measures (information quality, privacy, etc.), within the context of Google Search. These divergent results are robust to model families (Gemini, GPT), and major updates to the models that have been made over the last several months. Based on these results—in combination with more fundamental theoretical concerns—we argue that language model responses should not be used to replace or augment human survey responses at this point in time.


LLM-driven Bots in Web Surveys: Predicting Robotic Language in Open Narrative Answers

Mr Joshua Claassen (DZHW, Leibniz University Hannover) - Presenting Author
Professor Jan Karem Höhne (DZHW, Leibniz University Hannover)
Dr Ruben L. Bach (University of Mannheim)
Dr Anna-Carolina Haensch (Ludwig Maximilians University Munich)

Web survey data is key for social and political decision-making, including official statistics. However, respondents are frequently recruited through online access panels or social media platforms, making it difficult to verify that answers come from humans. As a consequence, bots – programs that autonomously interact with systems – may shift web survey outcomes and social and political decisions. Bot and human answers often differ regarding word choice and lexical structure. This may allow researchers to identify bots by predicting robotic language in open narrative answers. In this study, we therefore investigate the following research question: Can we predict robotic language in open narrative answers? We conducted a web survey on equal gender partnerships, including three open narrative questions. We recruited 1,512 respondents through Facebook ads. We also programmed two LLM-driven bots that each ran through our web survey 100 times: The first bot is linked to the LLM Gemini Pro, and the second bot additionally includes a memory feature and adopts personas, such as age and gender. Using a transformer model (BERT) we attempt to predict robotic language in the open narrative answers. Each open narrative answer is labeled based on whether it was generated by our bots (robotic language = “yes”) or the respondents recruited through Facebook ads (robotic language = “unclear”). Using this dichotomous label as ground truth, we will train a series of prediction models relying on the BERT language model. We will present various performance metrics to evaluate how accurately we can predict robotic language, and thereby identify bots in our web survey. Our study contributes to the on-going discussion on bot activities in web surveys. Specifically, it will extend the methodological toolkit of social research when it comes to identifying bots in web surveys.


Are Large Language Models Chameleons? An Attempt to Simulate Social Surveys

Mr Mingmeng Geng (SISSA) - Presenting Author
Dr Sihong He (UT Arlington)
Professor Roberto Trotta (International School for Advanced Studies (SISSA))

Can large language models (LLMs) simulate social surveys? To answer this question, we conducted millions of simulations in which LLMs were asked to answer subjective questions. Multiple LLMs were utilized in this project, including GPT-3.5, GPT-4o, LLaMA-2, LLaMA-3, Mistral, and DeepSeek-V2. A comparison of different LLM responses with the European Social Survey (ESS) data suggests that the effect of prompts on bias and variability is fundamental, highlighting major cultural, age, and gender biases. For example, simulations for Bulgarians work worse compared to people from other countries. We further discussed statistical methods for measuring the difference between LLM answers and survey data and proposed a novel measure inspired by Jaccard similarity, as LLM-generated responses are likely to have a smaller variance.

Although LLMs show the potential to perform simulations of social surveys or to replace human participants in limited settings, more advanced LLMs do not necessarily produce simulation results that are more similar to survey data. As with many LLM applications, prompts are important for our simulations, such as the order of options. The instability associated with prompts is a limitation of simulating social surveys with LLMs. Prompts that include more personal information are likely to yield better simulation results. Our experiments reveal that it is important to analyze the robustness and variability of prompts before using LLMs to simulate social surveys, as their imitation abilities are approximate at best.


Using Large Language Models to Predict Subjective Life Expectancy in the Context of Cross-Cultural Surveys: Insights into Cognitive Biases and Response Quality

Mr Mao Li (University of Michigan)
Miss Xinyi Chen (University of Michigan) - Presenting Author
Mr Zeyu Lou (University of Michigan)
Mr Kaidar Nurumov (University of Michigan)
Miss Stephanie Morales (University of Michigan)
Professor Sunghee Lee (University of Michigan)

The subjective life expectancy (SLE) question, which asks respondents to rate their probability of living beyond certain ages on a 0 to 100 scale, provides valuable data for predicting mortality-related behaviors. However, interpreting why respondents provide specific answers on SLE remains challenging. Probing questions, such as "Why do you choose ABC response?", are commonly used to capture respondents' reasoning processes. To deepen our understanding of these responses, this study leverages Large Language Models (LLMs) to explore two key objectives: first, to evaluate answer quality by categorizing probing responses as supportive, contradictory, or unrelated to the SLE rating, aiming to uncover potential cognitive biases that may lead respondents to over- or underestimate their lifespan; and second, to predict SLE ratings based solely on probing responses and assess the accuracy of these predictions.
This study contributes to the broader discourse on the utility of LLM-generated responses (synthetic respondents) in social science research. Specifically, we evaluate the potential of LLMs to approximate human reasoning by comparing their performance in predicting SLE ratings with actual human responses from a web panel survey (N=1,793) and the Survey of Consumer Attitude (N=506) across three languages (English, Spanish, German). The model’s predictions differed from respondents’ ratings by an average of 15%, suggesting reasonable alignment. However, misalignment between LLM-synthetic responses and SLE ratings illuminated heuristic or emotional influences on lifespan estimation, offering insights into the limits of synthetic responses in replicating complex human cognition. Further, socio-demographic patterns within these misaligned cases provide a window into how different groups approach probing questions, illustrating the potential and limitations of synthetic respondents in reflecting human diversity. This research underscores the importance of methodological rigor in designing and interpreting LLM-human comparative studies and highlights.


Synthetic Respondents, Real Bias? Investigating AI-Generated Survey Responses

Ms Charlotte Pauline Müller (Lund University) - Presenting Author
Dr Bella Struminskaya (Utrecht University)
Dr Peter Lugtig (Utrecht University)

The idea to simulate survey respondents has lately been seen as a promising data collection tool in academia and market research. Previous research has shown that LLMs are likely to reproduce human biases and stereotypes existent in their training data. Because of this, we further investigate the potential benefits and challenges of creating synthetic response datasets by following two major aims: 1. investigate whether AI tools can replace real survey respondents, and if yes, for which questions and topics, and 2. explore whether intentional prompts reveal underlying biases in AI prediction.
We compare already existing survey data from the German General Social Survey (Allbus) 2021, to AI-generated synthetic data with the OpenAI model GPT-4. Here, we took a random sample of 100 respondents from the Allbus dataset and created a so-called AI-Agent for each. Each Agent was calibrated based on general instructions and individual background information. We chose to predict numerical, binary, and open text/string format, each of them inheriting the potential to provoke certain biases, e.g., social desirability, gender, and age stereotypes. Furthermore, each item was tested across different contextual factors, such as AI model calibration and language settings. We found a deep lack of accuracy in the simulation of survey data for both numerical (r = -0.07, p = 0.6) as well as binary outcomes (χ² (1) = 0.61, p = 0.43, V = 0.1), while the explanatory power of the background variables for the predicted outcome, was high for both the former ( = 0.4) and the latter ( = 0.25). We found no difference in the prediction accuracy between different input languages and AI model calibrations. While predicting open-text answers, individual information was generally well considered by the AI. However, several potential biases became apparent, e.g., age, gender, and regional biases.


Evaluating Methods for Producing LLM Answers to Closed-Ended Survey Questions

Mr Georg Ahnert (University of Mannheim) - Presenting Author

A growing body of research connects survey instruments and large language models (LLMs), either to simulate human survey responses (Argyle et al., 2023; inter alia), or to measure attitudes, opinions, and values embedded in LLMs. Like humans, LLMs are not designed to respond with closed-ended response options, but to produce open-ended text instead. Researchers therefore have to add additional instructions to the prompt, or extracted answers from an LLM's first-token probabilities, both of which has been shown to return survey answers that indicate vastly different tendencies than an LLM's open-ended response. A standard method for producing answers from LLMs to survey questions on attitudes, opinions, and values has yet to be identified.

We aim to benchmark established answer production methods and to investigate novel methods based on recent advances in general LLM text generation, e.g., chain-of-thought prompting, or structured outputs. We propose two benchmarking approaches: First, we establish ground-truth differences in LLMs by gradually fine-tuning them on text data labeled with expressed emotions, or self-reported personality traits. For each answer production method, we measure how well the LLM's survey response tracks the increasing fine-tuning. Second, we deploy a ground-truth-less evaluation of LLMs using answer-pooling and multitrait-multimethod approaches, to quantify the variance that is inherent to each answer production method.

Our early results indicate differences in the robustness of answer production methods to continued fine-tuning on labeled data. We also find that methods which aggregate the probabilities of all tokens in an answer option improve upon first-token based methods, but pose additional challenges in handling conditional token probabilities. With our benchmark, we want to better understand the benefits and drawbacks of the many methods for producing LLM answers to survey questions.