All time references are in CEST
Harnessing AI and machine learning techniques for survey data collection |
|
Session Organisers | Professor Gabriele Durrant (University of Southampton) Professor David Bann (University College London) Dr Liam Wright (University College London) |
Time | Tuesday 18 July, 09:00 - 10:30 |
Room |
Recent innovations in Artificial Intelligence and Machine Learning (hereafter ‘AI’) are projected to transform society and influence how we conduct research and, more specifically, how we perform survey data collection and analyses. This session will focus on recent developments in the use of AI for survey data collection. The use of such techniques may have significant impacts on the quality of the resulting data and the speed of production and may open up new survey design opportunities.
For this session, topic areas of interest include, but are not limited to: the use of Large Language Models (LLM) for survey research, data collection and data usage, use of LLMs for questionnaire design and optimisation, for tailoring questions to respondents based on profiles or previous answers, the use of AI for survey cost efficiencies, error detection, for evaluation and testing, variable coding, analysis of open-ended text data, cognitive testing in surveys, improvement of measurement quality, the role of LLMs as (qualitative) interviewers, and the use of AI-driven chatbots. Quality implications and resulting challenges based on the use of AI will be carefully discussed.
At present, AI techniques to handle and analyse existing and new forms of data are being developed at speed, but skills development from such research is limited, not easily accessible, and not always targeted at those most in need. The session will discuss these opportunities and challenges and will feature the latest developments from a research and training and capacity building project in the UK run as part of the National Centre for Research Methods (NCRM). The project will drive forward research under three AI related themes and will deliver an innovative Training and Capacity Building (TCB) programme to ensure survey researchers stay at the forefront of digital skills concerning AI.
Keywords: AI, machine learning, large language models, survey data collection, questionnaire design
Dr Georg Wittenburg (Inspirient) - Presenting Author
Dr Josef Hartmann (Verian)
We explore the application of Generative AI models, particularly Large Language Models (LLMs), for survey research, emphasizing their potential to quantify qualitative information from open-ended responses. In the field of Generative AI, textual data, incl. Internet content, can be conceptualized as a vast set of implicit question-answer pairs. These pairs, when utilized as training data, are compressed within an LLM’s weight parameters, enabling the model to probabilistically return context-appropriate responses to a given prompt. Repeated prompting, especially when constrained to numeric answers („On a scale from 1 to 10, tell me…?“), yields a sample that can be statistically analysed to infer information within the model's embedded data.
Two examples are provided to illustrate this method. First, we prompt an LLM to assess hypothetical income levels associated with various job titles, generating statistically interpretable estimates of income distribution. Second, we prompt the model to estimate the likelihood of voting for a political candidate based on verbatim feedback, providing an output scale from 1 to 10 indicating a propensity to vote. This repeated sampling approach allows quantitative insights to be derived from qualitative information through multiple iterations, transforming open-text responses into data suitable for statistical analysis, while at the same time compensating for the well-documented propensity of LLMs to „hallucinate“ answers in the absence of actual information.
The method is valuable due to its flexibility in setting evaluative criteria (e.g., income, political alignment) and input data types (e.g., job roles). Furthermore, this approach enables low-cost, coding-free statistical analysis of open-text data. Applied to real-world samples, this method could supplement survey data, thus addressing item-nonresponse issues. This research presents a novel approach to integrating LLMs into survey research, expanding the potential for statistical analysis of open-ended qualitative data.