All time references are in CEST
Quality Assurance in Survey Data: Frameworks, Tools, and Quality Indicators |
|
Session Organisers | Mr Thomas Knopf (GESIS- Leibniz Institute for the Social Sciences ) Mrs Fabienne Krämer (GESIS- Leibniz Institute for the Social Sciences ) Dr Jessica Daikeler (GESIS- Leibniz Institute for the Social Sciences ) |
Time | Tuesday 18 July, 09:00 - 10:30 |
Room |
Survey data, collected through various modes such as online surveys, face-to-face interviews, and telephone surveys, is subject to a wide range of potential errors that can compromise data integrity. Addressing these challenges requires robust frameworks, advanced tools, and reliable quality indicators to manage, validate, and enhance survey data quality.
This session will focus on the key aspects of quality assurance in survey data collection and analysis, with a particular emphasis on the development and application of data quality indicators. We invite contributions that suggest and showcase quality indicators, designed to maintain the integrity and usability of survey data. Key topics will include:
1. Frameworks for Quality Assurance: An overview of frameworks developed to assess the quality of survey data.
2. Tools and Platforms for Data Validation: A discussion on tools and technologies aimed at validating or improving the quality of survey data as well as platforms tailored to combine tools, such as the KODAQS toolbox.
3. Data quality indicators: We seek contributions that demonstrate effective use of quality indicators like response bias indicators or data consistency checks in real-world case studies, showcasing how they address and enhance data quality.
4. Didactics of Data Quality Issues: Approaches to teaching and promoting data quality assurance for survey data. This section will explore educational strategies to equip researchers and practitioners with the necessary skills to effectively tackle data quality issues.
Keywords: survey quality tools, data quality, frameworks, quality indicators, training, didactics
Dr Fiona Draxler (University of Mannheim) - Presenting Author
Data quality frameworks and reporting guidelines support researchers in identifying potential data quality concerns with their research. However, it is unclear to what extent publications actually report on data quality limitations and which aspects are rarely mentioned.
We analyze the “Limitations” sections and limitation paragraphs in “Discussion” sections of substantive survey-based research published in selected journals including Public Opinion Quarterly, the American Sociological Review, and the Annual Review of Sociology. We extract data-quality-related keywords of these paragraphs/sections and cluster them by themes, in alignment with the components of the Total Survey Error framework.
Based on this, we discuss what data quality limitations are commonly and rarely mentioned, and what possible reasons for these differences may be. Through comparisons with reporting guidelines such as those of the AAPOR transparency initiative, we highlight areas where researchers might require additional support in choosing suitable identifiers and reporting quality limitations. We also analyze areas where current guidelines might be adapted to better represent researchers’ needs in reporting. Thus, we contribute to the transparent and well-structured communication of data quality as a crucial step for validating research.
Mrs Cristina Tudose (Ipsos KnowledgePanel Europe) - Presenting Author
Dr Joke Depraetere (Ipsos KnowledgePanel Europe)
Dr Femke Dekeulenaer (Ipsos KnowledgePanel Europe)
This study delves into the critical issue of data quality variations between opt-in and probability-based online panels, a topic of increasing relevance in online survey research. This study analyzes such variations using four sample sources across Sweden, France, and the Netherlands. Each set includes a probability-based sample from Ipsos’ KnowledgePanel Europe and three opt-in samples.
The study evaluates how effectively demographic structures are met and compares the prevalence of low quality across several sample sources. Low quality responses can compromise data reliability and lead to flawed conclusions. Both standard quality metrics (e.g., speeding, straight-lining) and specific checks for questionnaire inconsistencies as an additional quality tool are used to evaluate quality of responses. The analysis also incorporates an investigation of online survey behavior and answer patterns in open-ended questions, offering insights into response quality across sample sources. By exploring the interplay between survey frequency and survey professionalization, the study sheds light on potential biases and their influence on data quality, further enriching the understanding of the factors contributing to data quality variations.
This research provides insight into the comparative quality of opt-in and probability-based samples and the variations of quality within opt-in samples, informing researchers and practitioners on appropriate survey methodologies for European contexts. The study's findings contribute to understanding the strengths and limitations of different online sampling approaches, ultimately enhancing data quality and research reliability.
Dr Ranjit K. Singh (GESIS - Leibniz Institute for the Social Sciences) - Presenting Author
It is challenging to assess the quality of single questions for latent constructs, such as values, opinions, or interests. Psychometry instead uses multiple indicators, which allow us to judge measurement quality with factor analyses and measures of internal consistency. However, for single questions we cannot rely on the same methods.
I will argue that we can sidestep some of these limitations by comparing different single questions for the same construct. I transfer a construct validation approach proposed by Westen and Rosenthal to the realm of single-item measures in social science surveys. Building upon research on ex-post harmonization, I show that we can learn much about either question by comparing how they both correlate to the set of covariates. The method gives us a metric of how similar the constructs are that both questions measure and a coefficient that quantifies the relative reliabilities of both questions. Note that we do not need answers to both questions from the same set of respondents. We merely need two independent random samples from the same population; one for each question. A structure akin to split-ballot experiments.
The talk extends research presented at the last ESRA conference by exploring the two metrics with a set of simulations, a larger survey data set, and by applying it to data quality assessment (instead of data integration). Aside from demonstrating the validity of both metrics, I aim to show pragmatic use cases. For example, demonstrating empirically that we can use quite different questions (e.g., interest in political TV shows) as serviceable proxies for some constructs (e.g., general political interest). I also show that if we have quantified the reliability of one question, we can leverage that information to predict the quality of another question on the same construct.
Mr Luka Štrlekar (University of Ljubljana, Faculty of Social Sciences) - Presenting Author
Dr Vasja Vehovar (University of Ljubljana, Faculty of Social Sciences)
Web surveys can capture digital traces, known as paradata, which record respondents' activities while completing the questionnaire and provide insights into respondents' behavior. In practice, the most widely used type of paradata is response times (RTs) – the time required to complete a question, page, or entire survey. Respondents with short RTs are particularly commonly studied in relation to response quality, and survey duration is considered an important quality assurance indicator.
However, to accurately evaluate the relation between RTs and response quality, RTs must be properly analyzed. Although web survey tools typically allow for technically accurate measurement of RTs (at the page level), the main dilemma is whether they can be unreservedly used in further research. This issue is inadequately addressed in the literature, where simple surveys and engaged respondents are usually assumed.
Namely, RTs and related response speeds should reflect only respondents' cognitive processes and questionnaire characteristics, excluding confounding factors that could affect comparability: (1) pauses, multitasking behavior, and (2) backtracking mean that recorded RTs are misestimated, (3) answering open-ended questions artificially creates the appearance of slower response speed, while (4) not answering questions and (5) not being exposed to questions due to branching give the appearance of faster response speed.
By removing the confounding effects of these factors (e.g., subtracting the pausing duration), we develop a concept of “actual response speed”, which refers to the speed when respondents are engaged in the response process as their primary cognitive activity and are exposed to standardized cognitive tasks (questions). This ensures comparable survey conditions for all respondents and enables the proper use of adjusted RTs in further analyses, particularly for the calculation of the actual (i.e., “true”) response time, which serves as an important quality assurance indicator. We also develop standardized solutions for treating confounding factors using R.
Ms Jo Webb (UK Data Service, UK Data Archive, University of Essex)
Mrs Cristina Magder (UK Data Service, UK Data Archive, University of Essex)
Dr Sharon Bolton (UK Data Service, UK Data Archive, University of Essex)
Ms Liz Smy (UK Data Service, UK Data Archive, University of Essex)
Dr Hina Zahid (UK Data Service, UK Data Archive, University of Essex)
Mrs Beate Lichtwardt (UK Data Service, UK Data Archive, University of Essex) - Presenting Author
Mrs Gail Howell (UK Data Service, UK Data Archive, University of Essex)
Dr Finn Dymond-Green (UK Data Service, JISC)
This presentation explores strategies to equip data managers and practitioners with the skills needed to address data quality challenges effectively. Drawing on the “Skills Development for Managing Longitudinal Data for Sharing” project, commissioned by the ESRC and MRC as part of initial Population Research UK (PRUK) efforts, we present insights and resources developed to enhance data management and sharing practices within the Longitudinal Population Studies (LPS) community. This initiative responds to critical challenges UK Research and Innovation (UKRI) identified to maximise the use of LPS data across social, economic, and biomedical sciences.
Our presentation highlights the interactive training workshops designed and delivered to over 300 data managers and professionals. These workshops emphasised foundational to advanced skills, such as synthetic data creation and harmonisation tools and incorporated continuous evaluation to adapt to the community's evolving needs. The freely available, open-licensed training materials developed during this project provide a practical resource to support the LPS community and broader data professionals in improving data quality assurance and sharing, aligning with PRUK’s vision for impactful data use.
Ms Sophia Piesch (University of Mannheim) - Presenting Author
Professor Florian Keusch (University of Mannheim)
In the literature, various indicators such as straightlining and acquiescence have been proposed to detect non-optimal response behavior and assess the quality of survey data. One of the key challenges in effectively using and interpreting these indicators is the inconsistency in terminology, theoretical concepts, and methods for constructing them. While survey methodology literature predominantly examines these indicators within the framework of satisficing theory (Krosnick, 1991, 1999), psychological research draws on the concept of response styles (Paulhus, 1991). These concepts partly overlap in their theoretical assumptions, however, are defined differently and assume different underlying cognitive processes. Yet, researchers frequently use them interchangeably and imprecisely, both within and across disciplines. This general lack of clarity hinders the comparability of studies and utility of these indicators to assess the quality of responses.
To address this issue, we conduct a comprehensive, multidisciplinary systematic review of empirical studies published since 2010 on response quality indicators. We provide a structured overview of how researchers measure and conceptualize different response quality indicators and document how these indicators vary as a function of personal, situational, instrument-related characteristics. Our review will allow us to address key questions about the theoretical relationships between these indicators: Can findings related to one indicator be generalized to others? Do the different indicators reflect the same underlying construct, or are they influenced by different factors? A particular strength of our review is that we widely search for evidence across different strands of literature, rather than focusing on either satisficing theory or response styles, which have been the subject of previous reviews (Roberts et al., 2019; Van Vaerenbergh & Thomas, 2013). By doing so, we aim to leverage insights from different research areas to gain a better conceptual understanding of these indicators and to inform researchers on how to properly measure and apply them.
Mrs Corinna König (Institute for Employment Research) - Presenting Author
Professor Joe Sakshaug (Institute for Employment Research)
Due to declining response rates and higher survey costs, establishment surveys are (or have been) transitioning from traditional interviewer modes to online and mixed-mode data collection. The IAB Establishment Panel of the Institute for Employment Research (IAB), which was primarily a face-to-face survey, also experimented with an online starting mode followed by face-to-face as part of a sequential mixed-mode design. The control group had the traditional face-to-face design. Previous analyses have shown that the mixed-mode design maintains response rates at lower costs compared to the face-to-face design, but the question remains to what extent introducing the web mode affects data quality. We address this research question through several analyses. First, by comparing 20 survey responses from the single- and mixed-mode experimental group to corresponding administrative data from employer-level social security notifications. Using administrative data, the accuracy of survey responses in both mode designs is assessed and measurement equivalence is evaluated. Second, by comparing social desirability answers between the individual online and the face-to-face modes. Third, by reporting on differences in triggering follow-up questions when answering filter questions and fourth, looking at item nonresponse in the different mode groups. To account for selection and nonresponse bias, selection weights are used throughout the analysis. First, results show that measurement error bias in online interviews is sometimes larger than in face-to-face interviews but compared to the mixed-mode design the differences are not significant anymore. Looking at sensitive questions on social desirability, it was generally found that respondents do not answer less socially desirably in the online mode. This applies with only a few exceptions. Thus, the study provides comprehensive insights into data quality for mixed-mode data collection in establishment surveys.
Dr Phillips Benjamin (Social Research Centre) - Presenting Author
Dr Dina Neiger (Social Research Centre)
Mr Kipling Zubevich (Social Research Centre)
Fraud is rife in samples recruited via open links (c.f. Bonett et al. 2024; Johnson et al. 2024; Keeter et al. 2024; Pinzón et al. n.d.; White-Cascarilla and Brodhead n.d.). However, even online surveys protected using unique URLs or login pages are at risk from fraudsters using mass attacks to guess login information and obtain incentives. Defeating such attacks requires countermeasures that can negatively impact the respondent experience—at a time when response rates already at perilously low levels and serious nonresponse error is common—and will give rise to false positives for fraud, where these countermeasures include requiring respondents to pass CAPTCHAs, use of digital fingerprinting software to identify duplicate and fraudulent responses, and offline payment of incentives. Ideally, the severity of countermeasures to online survey fraud should be proportional to the risk, accounting for the likelihood and severity of impact of fraud.
We present a risk-based approach to determining an appropriate level of mitigation of fraud risk, balancing the strength of mitigations against the degree of risk. On the risk side, the approach accounts for factors including whether PII will be piped in (e.g., from prior waves or the sampling frame), the nature of the sampling frame and means of recruitment, the incentives used, the number of invitations, and the uses to which the research will be put (e.g., as input to public policy decisions). Mitigations considered include CAPTCHAs, digital fingerprinting and other fraud detection software, two factor authentication, user ID complexity, means of incentive payment, and planned QC activities. The two are balanced using a set of importance weights, which have been refined through use of the tool.
This paper contributes to the ongoing development of approaches to address fraud in online surveys.
Dr Masha Krupenkin (University of Maryland) - Presenting Author
Dr Andrew Gordon (Prolific)
Dr David Rothschild (Microsoft Research)
With the recent proliferation of automated text analysis methods, scholars are increasingly turning to open-ended survey responses as a measure of public opinion. However, the quality of open-ended responses on online surveys can be extremely variable. This paper presents several methods to assess and improve the quality of open-ended survey responses.
We develop and test four measures to determine the quality of open-ended responses. In Measure 1 (baseline), we use answer wordcount as a baseline measure of answer quality. In Measure 2 (human-coded), we employ workers on Prolific to rate open-ended survey responses. In Measure 3 (GPT), we train a GPT-4 model to rate OE responses using the training set of human ratings generated for Measure 2. In Measure 4 (Hybrid), we use human annotators to assess and adjust the LLM ratings generated by GPT-4.
We examine variability in OE data quality based on several sets of respondent characteristics. The first, platform, is a proxy for general audience quality. Our second set of respondent characteristics examines prior platform experiences, including number of studies taken, approval rate, and more. Finally, we examine how respondent age and education shapes OE answer quality.
We also test the impact of survey infrastructure on OE data quality. We test the effects of desktop vs mobile infrastructure, as well as text vs voice answer options. We also examine the interaction between infrastructure and audience characteristics. Does more accessible OE infrastructure lead to greater increases in OE answer quality for specific audiences?
Finally, we examine the sensitivity of different summary models to OE response quality. We assess the following summary models: Topic Models, BERT, GPT-4. In addition to testing these models on real OE responses, we test these models using a synthetic dataset generated by GPT-4 that
Professor Sven Stadtmüller (University of Applied Sciences Göttingen) - Presenting Author
Professor Henning Silber (University of Michigan)
Increasingly more surveys are conducted, and surveys are regularly used in political and economic decision making. Similarly, members of the public use survey results reported in the mass media to inform themselves about public opinion on political and economic issues. However, the increase in surveys is accompanied with a decline of survey quality and quality control. Even high-end news outlets often fail to distinguish between surveys that are conducted according to scientific standards and those that are not.
Recent research from Germany and the United States suggests that people's knowledge about survey quality is sparse. At best, members of the general public use heuristics regarding the sample size (the more, the better) and the sample composition (representativity equals high quality). However, important quality indicators such as the sampling method (i.e., random sample) and the response rate are rarely used. Additionally, little is known about how the general public evaluates survey quality and the trustworthiness of survey results when different quality information suggests different levels of survey quality. For example, if an individual receives a result from a survey which relies on a convenience sample (low quality) but has an impressive sample size (high quality), the recipient might either be inclined to use the sample size heuristic or rather base his or her level of trust on the low-quality sample.
Behind this background, this research uses an experimental design fielded in the German probability-based GESIS Panel to test a knowledge intervention. Specifically, the intervention briefly explains the relative importance of the sampling method and the sample size for the quality of a survey. Afterwards, we test (1) whether the intervention helped respondents to adequately discriminate “good” from “bad” surveys and (2) which individual-level factors (e.g., survey attitudes) moderate the effect of the knowledge intervention.