ESRA logo

ESRA 2023 Glance Program


All time references are in CEST

Synthetic Data Generation and Imputation with LLMs

Session Organisers Dr Anna-Carolina Haensch (LMU Munich)
Professor Frauke Kreuter (LMU Munich)
TimeTuesday 18 July, 09:00 - 10:30
Room

The use of large language models (LLMs) for generating synthetic data and performing data imputation has become increasingly prominent. The created data has been used for a variety of applications, from training machine learning models to filling gaps in incomplete datasets. Generating LLM synthetic data usually involves simple prompting techniques, often using so-called personas, but newer approaches now allow for more sophisticated methods such as fine-tuning LLM models to specific tasks such as synthesis and imputation. This session aims to bring together researchers and practitioners from fields such as data science, NLP and computer science to explore these advancements. We will discuss how to evaluate the quality of synthetic data and examine the effectiveness of various methods for generating and using it. Submissions are encouraged that cover topics such as:

- Evaluation techniques and frameworks for synthetic data quality
- Advances in imputation using LLMs
- Fine-tuning LLMs for specific data generation tasks
- Case studies demonstrating the application of LLM synthetic data in research or industry, especially for hard-to-reach population
- Methods for generating synthetic data with large language models

Keywords: LLMs, imputation, synthetic data

Papers

United in Diversity? Contextual Biases in LLM-Based Predictions of the 2024 European Parliament Elections

Miss Leah von der Heyde (LMU Munich, Munich Center for Machine Learning) - Presenting Author
Dr Anna-Carolina Haensch (LMU Munich, University of Maryland)
Dr Alexander Wenz (University of Mannheim, Mannheim Centre for European Social Research)
Mr Bolei Ma (LMU Munich, Munich Center for Machine Learning)

It has been proposed that “synthetic samples” based on large language models (LLMs) could serve as efficient alternatives to surveys of humans, considering LLM outputs are based on training data that includes information on human attitudes and behaviour. However, LLM-synthetic samples might exhibit bias, for example due to training data and fine-tuning processes being unrepresentative of diverse contexts. Such biases risk reinforcing existing biases in research, policymaking, and society. Therefore, researchers need to investigate if and under which conditions LLM-generated synthetic samples can be used for public opinion prediction. In this study, we examine to what extent LLM-based predictions of individual public opinion exhibit context-dependent biases by predicting the results of the 2024 European Parliament elections. Prompting three LLMs with individual-level background information of 26,000 eligible European voters, we ask the LLMs to predict each person’s voting behaviour. By comparing them to the actual results, the study shows that LLM-based predictions of future voting behaviour largely fail, their accuracy is unequally distributed across national and linguistic contexts, and they require detailed attitudinal information. The findings emphasise the limited applicability of LLM-synthetic samples to public opinion prediction. In investigating their contextual biases, this research contributes to the understanding and mitigation of inequalities in the development of LLMs and their applications in computational social science.


Imputing Missing Survey Data with Retrieval-Augmented Large Language Models

Mr Tobias Holtdirk (GESIS – Leibniz Institute for the Social Sciences) - Presenting Author
Mr Georg Ahnert (University of Mannheim)
Dr Anna-Carolina Haensch (LMU Munich)

Many recent studies have explored using large language models (LLMs) to synthetically generate survey data through “silicon sampling” (Argyle et al., 2023; inter alia). However, in many situations, survey responses are not missing completely, i.e., survey answers do not have to be predicted “zero-shot”, and answers by other participants as well as the information provided by the individual can be used to impute missing data. Statistical methods take this auxiliary information into account, but they are less effective when data is missing not at random, and they cannot directly handle open-ended responses.

We propose to impute missing values in survey data by prompting LLMs in an extended, “few-shot” setting, where the LLM can learn from auxiliary examples before predicting a target individual’s survey response. Using a retrieval augmented generation (RAG) approach, we select these auxiliary examples based on their text embedding similarity to the target individual. We test various selection strategies, including selecting the most or least similar examples, as well as stratified or random selection. We benchmark our approach on multiple datasets, including data which was previously used in “silicon sampling” studies, for instance from the ANES 2016.

Our early results indicate that our approach consistently outperforms kNN classifiers (retrieval-only) and zero-shot LLMs (generation-only), even when data is missing completely at random. We also outperform logistic regressions when data is missing not at random. In our “few-shot” setting, LLMs suitable to run on consumer hardware (Llama 3.2 3B) can match the performance of much larger models (Llama 3.3 70B).

Retrieving examples for “few-shot” prompting is a promising extension to previous “silicon sampling” methods. We aim to create a Python package that allows researchers to easily use our RAG system on consumer hardware with their own survey data.