All time references are in CEST
Assessing the Quality of Survey Data |
|
Session Organiser | Professor Jörg Blasius (University of Bonn) |
Time | Tuesday 18 July, 09:00 - 10:30 |
Room |
This session will provide a series of original investigations on data quality in both national and international contexts. The starting premise is that all survey data contain a mixture of substantive and methodologically-induced variation. Most current work focuses primarily on random measurement error, which is usually treated as normally distributed. However, there are a large number of different kinds of systematic measurement errors, or more precisely, there are many different sources of methodologically-induced variation and all of them may have a strong influence on the “substantive” solutions. To the sources of methodologically-induced variation belong response sets and response styles, misunderstandings of questions, translation and coding errors, uneven standards between the research institutes involved in the data collection (especially in cross-national research), item- and unit non-response, as well as faked interviews. We will consider data as of high quality in case the methodologically-induced variation is low, i.e. the differences in responses can be interpreted based on theoretical assumptions in the given area of research. The aim of the session is to discuss different sources of methodologically-induced variation in survey research, how to detect them and the effects they have on the substantive findings.
Keywords: Quality of data, task simplification, response styles, satisficing
Professor Tobias Gummer (GESIS - Leibniz Institute for the Social Sciences) - Presenting Author
Dr Tanja Kunz (GESIS - Leibniz Institute for the Social Sciences)
Mr Oscar Martinez (GESIS Leibniz Institute for the Social Sciences)
Mr Matthias Roth (GESIS Leibniz Institute for the Social Sciences)
Political knowledge questions are frequently used in political science surveys, yet their validity is increasingly challenged in web survey contexts where respondents can look up answers. When looking up answers, measures of declarative memory are confounded with procedural memory. Previous studies have focused on detecting and deterring such lookup behavior, often assuming respondents use search engines or databases like Wikipedia. However, the rise of generative AI applications, such as ChatGPT and Microsoft Copilot, introduces new dimensions to the information search process. These tools, designed to generate human-like text, offer an alternative means for retrieving information, potentially influencing how political knowledge scores reflect procedural memory.
This study explores the evolving landscape of lookup behavior by examining respondents’ use of generative AI to find answers to political knowledge questions. Through a two-study design, we first identify respondents’ distinct strategies for using these tools (Study I). Here, we conducted a web survey among respondents recruited from an online access panel. We then evaluate the textual information generated by AI applications based on these strategies, analyzing its complexity and content (Study II). Here, we collected texts generated by 4 generative AI applications based on 5 prompting strategies identified in Study I for 12 political knowledge questions from renowned election studies. Our findings highlight key differences in the search processes and the utility of information retrieved via generative AI to answer political knowledge questions. We aim to answer two research questions:
RQ1. How do respondents use generative AI applications to answer political knowledge questions?
RQ 2. How useful is the textual information obtained when using generative AI to answer political knowledge questions, and how does it differ depending on prompting strategies?
Mrs Leah Bloy (Hebrew University Business School ) - Presenting Author
Dr Yehezkel Resheff (Hebrew University Business School)
Professor Avraham N. Kluger (The Hebrew University)
Dr Nechumi Malovicki-Yaffe (Tel Aviv Univesity)
Invalid responses pose a significant risk of distorting survey data, compromising statistical inferences, and introducing errors in conclusions drawn from surveys. Given the pivotal role of surveys in research, development, and decision-making, it is imperative to identify careless survey respondents. The existing literature on this subject is comprised of two primary categories of approaches: methods reliant on survey items, and methods involving post-hoc analyses. The latter, which doesn’t demand preemptive preparation, predominantly incorporates statistical techniques aimed at identifying distinct response patterns that are associated with careless responses. However, several inherent limitations limit the precise identification of careless respondents. One notable challenge is the lack of consensus concerning the thresholds to use for the various measures. Furthermore, each method is designed to detect a specific response pattern associated with carelessness, leading to conflicting outcomes. This paper seeks to assess the efficacy of the existing methods using a novel survey methodology encompassing responses to both meaningful and meaningless gibberish scales, where the latter compels respondents to answer without considering item content. Using this approach, we propose the application of machine learning to identify careless survey respondents. Our findings underscore the efficacy of supervised machine learning combined with unique gibberish data methodology (GibML) as a potent method for the identification of careless respondents, aligning with and outperforming other approaches in terms of effectiveness and versatility.
Dr Gianmaria Bottoni (City St George's, University of London) - Presenting Author
Dr Eva Aizpurua (National Centre for Social Research (NatCen))
Survey responses are shaped by both explicit design choices and subtle features, such as the numerical labels of response scales. This study investigates the effect of numerical scale labels on response behaviour using 11-point bipolar scales with verbal anchors ("extremely negative" to "extremely positive") and two numerical label ranges: 0 to 10 and -5 to +5. Conducted in Great Britain, Hungary, and Portugal as part of a face-to-face cross-national survey, respondents were randomly assigned to one of these two scales to evaluate perceptions of climate change's overall impact and its effects on specific groups. Contrary to established findings (e.g., Schwarz et al., 1991; Tourangeau et al., 2007), mean scores were consistently lower for the -5 to +5 scale across countries and items. This suggests that numerical scale labels interact with contextual and item-specific factors, influencing response distributions in ways that may not generalise across studies. Additionally, respondents assigned to the 0 to 10 scale were more likely to select the midpoint (5) than those using the -5 to +5 scale (0), likely due to differing perceptions of neutrality associated with these options. These findings raise important questions about the context-dependence of numerical labels, particularly for polarising topics like climate change. The implications for survey design are relevant, highlighting how seemingly secondary design elements can shape data outcomes. We also discuss how some of these results differ from prior research and outline directions for future studies to better understand the nuanced effects of numerical labels across diverse topics.
Dr Martina Kroher (Leibniz University Hannover) - Presenting Author
Dr Sebastian Lang (Leibniz Institute for Educational Trajectories (LIfBi))
In the social sciences, scales are a common tool for collecting data on attitudes, beliefs, agreement and more. On the one hand, there are different scales to use, and on the other hand, the same scale can be used in different ways, e.g. by changing the endpoints: For example, from very satisfied to very dissatisfied or from very dissatisfied to very satisfied. It can be assumed that there is an individual effect on the respondent’s answers depending on which pole is offered first.
In addition, numbers are sometimes placed next to the scale points to identify the different responses on the scale. These numbers can also influence the answers given. Some respondents will not notice these small numbers, but others will, and this may lead to altered responses.
Overall, we are interested in whether there are any consequences when scales are implemented differently. Are they still measuring the same construct in the same way?
We analyze data from self-administered paper-and-pencil questionnaires randomly assigned to 6,000 households in Hanover, Germany, using an improved form of random route design. We test different scales by randomly varying several of these scaling options: (1) positive to negative response options with numbers, e.g. 1 to 11, (2) positive to negative response options without numbers, (3a) positive to negative response options with numbers, e.g. -5 to 5, and (3b) negative to positive response options with numbers, e.g. -5 to 5.
In our contribution, we will show whether there are effects on respondent behavior due to different scale designs in terms of labeling and direction. Initial (preliminary) results suggest that there is little effect of scale design on respondent behavior with respect to questions on life satisfaction.
Dr Fernanda Alvarado-Leiton (University of Costa Rica) - Presenting Author
The use of oppositely worded items in measurement scales is ubiquitous in survey research. Although controversial, mixing the direction of item wording to create balanced scales is still advised to address measurement errors such as response styles and straight-lining or to achieve scale validity.
Best practices for reversing items to create balanced scales remain up for debate, however, there is consensus that reversed items in balanced scales should be semantically equivalent to the un-reversed items.
Extant literature suggests that achieving semantic equivalence is a non-trivial task and could be dependent on multiple factors. One of these factors is the use of negations (e.g., satisfied/not satisfied) or polar opposite concepts (e.g., satisfied/unsatisfied) to reverse items. Empirical evidence suggests that although similar, negations and polar opposite wordings do not convey similar meanings and are not exactly opposite in meaning to the unreversed item.
However, today, most of the available evidence on these differences comes from small experiments that do not represent real survey scenarios. In this paper we explore the differences in meaning of negated, polar opposite items and unreversed items using data from a web survey about subjective well-being with n=3 600 participants. In addition, we investigate respondent demographics and item characteristics as possible sources of semantic differences, both missing in previous literature.
Data were collected through opt-in online panels in the United States and quotas by gender, education, race/ethnicity and age were in place to secure representation of different demographic groups. Participants were randomly assigned to negated, polar opposite or unreversed wording for five measurement scales using an Agree-Disagree rating scale. Data are analyzed through multi-level modelling to account for both respondent and item variables.
Ms Kristín Hulda Kristófersdóttir (University of Iceland) - Presenting Author
Dr Vaka Vésteinsdóttir (University of Iceland)
Dr Hafrún Kristjánsdóttir (Reykjavík University)
Dr Þorlákur Karlsson (brandr)
Professor Fanney Þórsdóttir (University of Iceland)
The Patient Health Questionnaire-9 (PHQ-9) is one of the most widely used tools for screening and assessing depression. However, previous research has yielded inconsistent results regarding its factor structure, with most studies suggesting either a one- or two-factor model. One possible explanation for the emergence of a two-factor structure is that certain items may be more sensitive than others and, therefore, more likely to lead to socially desirable responding (SDR). This study explores this possibility by assessing the sensitivity of the PHQ-9 items. A total of 273 participants completed 36 paired comparisons of the PHQ-9 items, indicating which symptoms they would find more uncomfortable to disclose. Additionally, absolute judgments were collected, where participants rated each item as either uncomfortable or not uncomfortable to disclose. Data were analyzed using a model for pair comparisons rooted in Thurstone's law of comparative judgment to estimate the relative sensitivity of each item and whether they were more or less likely to be judged as (not) uncomfortable to disclose. Kendall's coefficients of consistence and agreement were calculated to evaluate the internal consistency of participants' responses and the level of agreement between them. Results showed that cognitive/affective symptoms, such as feelings of worthlessness and depressed mood, were perceived as more sensitive than somatic symptoms like fatigue and sleep disturbances. Notably, the sensitivity estimates obtained in this study align closely with prior factor analytic findings that have supported a two-factor model distinguishing cognitive/affective and somatic symptoms. These findings suggest that SDR may contribute to the underreporting of certain depression symptoms, particularly cognitive/affective ones, potentially accounting for the inconsistent factor structures observed in previous research. Consequently, both researchers and clinicians should consider the impact of SDR when interpreting PHQ-9 scores to ensure more accurate assessments.
Professor Jörg Blasius (University of Bonn) - Presenting Author
Professor Susanne Vogl (University of Stuttgart)
Outliers are common in all social surveys, and it is well-known that they can adversely affect the solutions of multivariate data analysis. This has been discussed especially for regression analysis, but also for principal component analysis and other scaling and cluster methods. There are many different techniques suggested in the literature for reducing the effects of outliers. In the simplest case, the affected cases are deleted. In the social sciences, outliers are rarely discussed, least of all in combination with respondent behavior such as satisficing. Further, almost no attention has been paid to those cases where the effect of outliers is ampli-fied by routinely applied techniques, e.g., rotation in principal component analysis.
In principal component analysis, categorical principal component analysis, and factor analy-sis, varimax rotation is often applied, sometimes even without discussing the unrotated solu-tion. In this paper, we show that this kind of rotation sometimes only optimally adapts the outliers of a survey, which might be caused by a small number of respondents giving arbi-trary answers or using some kind of simplified response style; this response behavior is often called satisficing. As a result, the content of the solution can be changed. To illustrate our findings, we use empirical data from a self-administered online survey of pupils aged 14–16 from lower track secondary schools in Vienna, Austria, in 2018 (N=3,078).
Dr Thorsten Euler (German Centre for Higher Education Research and Science Studies) - Presenting Author
Mrs Ulrike Schwabe (German Centre for Higher Education Research and Science Studies)
Respondents’ willingness to answer surveys strongly determines data quality. However, research has shown that survey nonresponse is a reason for concern in household surveys in many countries (de Leeuw & de Heer, 2022; de Leeuw et al., 2018; Luiten et al., 2020). Low response rates are perceived by the public as indicators of inferior meaningfulness. For researchers, they increase recruitment costs and efforts to achieve targeted samples sizes.
While household surveys cover the whole population, we focus on surveys with highly qualified defined as individuals holding an entrance qualification to higher education as a minimum. Highly qualified individuals are a special group as they are in particular often invited to participate in surveys within their respective educational institutions. As a result, they are experiencing higher survey burden and are more likely to be the source for scientific research findings. Covering a time span from the late 1980s to 2022, we map nonresponse trends for students, graduates, PhD candidates and holders and professors in selected voluntary surveys in Germany. Further, we analyze how modes of administration, modes of contact, and incentivization influence nonresponse rates in one-off and panel surveys.
Generally, nonresponse rates in surveys with highly qualified decline over time as our target group suffers from response burden as the overall number of surveys being conducted has increased. Our investigation shows that paper and pencil surveys have higher response rates, while online only leads to lowest commitment. Monetary incentives are more attractive for students. Professors’ willingness to participate, however, seems to be boosted by being informed on the results.
Mrs Fabienne Kraemer (GESIS - Leibniz Institute for the Social Sciences) - Presenting Author
Social desirability (SD-) bias (the tendency to report socially desirable opinions and behaviors instead of true ones) is a widely known threat to the validity of self-reports. Previous studies investigating socially desirable responding (SDR) in a longitudinal context provide mixed evidence on whether SD-bias increases or decreases with repeated interviewing and how these changes affect response quality in subsequent waves. However, most studies were non-experimental and only suggestive of the mechanisms of change in SD-bias over time. This study investigates SDR in panel studies using a longitudinal survey experiment comprising six waves. The experiment manipulates the frequency of answering identical sensitive questions (target questions) and assigned respondents to one of three groups: The first group received the target questions in each wave, the second group received the questions in the last three waves, and the control group received the target questions only in the last wave. The experiment was conducted within a German non-probability (n = 1,946) and a probability-based panel (n = 4,660). The analysis focuses on between- and within-group comparisons to investigate changes in answer refusal and responses to different sensitive measures. To further examine the underlying mechanisms of change, I conduct moderator and mediator analyses on the effects of respondents’ privacy perceptions and trust towards the survey (sponsor). First results show a decrease of answer refusal and SDR with repeated interviewing for most of the analyzed sensitive measures. However, these decreases were non-significant for both between-group comparisons and comparisons over time. Altogether, this study provides experimental evidence on the impact of repeated interviewing on changes in SD-bias and contributes to a deeper understanding of the underlying mechanisms by examining topic-specific vs. general survey experience and incorporating measures on privacy perceptions and trust towards the survey (sponsor).
Dr Alexandru Cernat (University of Manchester) - Presenting Author
Dr Chris Antoun (University of Maryland)
Reliability in survey measurement is a crucial aspect of data quality. Yet surprisingly little attention has been given to assessing reliability for commonly used survey measures or evaluating how reliability may vary across different survey contexts. This study leverages the Comparative Panel File (CPF), a dataset compiled from long-running household panel surveys in seven countries—Australia, Germany, Russia, South Korea, Switzerland, the United Kingdom, and the United States—to estimate the reliability of 14 survey indicators from 2001 to 2020 using quasi-simplex models. We find that reliability is high and consistent across countries for factual items but substantially lower and more variable for self-assessment items (e.g., satisfaction with life, satisfaction with work, self-rated health). Additionally, we find that variations in question wording across the panels (e.g., asking about health currently versus in general) led to different reliabilities. We discuss the implications these findings have for measurement comparability in cross-national research.
Dr Alexandru Cernat (University of Manchester) - Presenting Author
Dr Joe Sakshaug (IAB)
Dr Bella Struminskaya (Utrecht University)
Dr Susanne Helmschrott (Bundesbank)
Dr Schmidt Tobias (Budensbank)
This study investigates the phenomenon of panel conditioning and its impact on the reliability and stability of financial behaviour and expectation measures collected through a high-frequency longitudinal survey. Panel conditioning occurs when repeated participation in a survey influences respondents' behaviours and responses, potentially leading to biased data. Utilizing a quasi-experimental design and structural equation modelling, we analyse data from a monthly online survey to assess the extent to which panel participation affects data quality. Our findings indicate that while panel conditioning can lead to a slight increase in the reliability of certain measures over time, its effects on stability are more complex and variable-dependent. This research contributes to the broader understanding of panel conditioning effects, offering methodological guidance for future studies in this domain.
Dr Nicholas Yeh (internal revenue service) - Presenting Author
Dr Gwen Gardiner (internal revenue service)
Dr Scott Leary (internal revenue service)
Ms Brenda Schafer (internal revenue service)
The United States Internal Revenue Service (IRS) annually administers the Individual Taxpayer Burden Survey to gather data about the time and money that individuals spend to comply with federal tax reporting requirements. Triennially, individuals who filed their return after December of the year the return was due are also surveyed. These late filers often respond at lower rates which could cause data validity issues. The current survey protocol involves mailing an invitation to complete a web-only survey. This invitation directs them to a survey landing page on IRS.gov where they read information about the purpose of the study and click a link to begin. This study focuses on testing modifications to the introductory language on the survey landing page and the online survey introduction pages. These modifications emphasize benefits to the participants, enhance readability (e.g., making critical information more salient), and other best practices aimed at improving the number and quality of survey responses. In this study, individuals were randomly assigned to receive either the modified language (experimental condition; N = 7,850) or the original language (control condition; N = 7,850). Preliminary analysis revealed the experimental condition significantly increased the response rates compared to control condition (21%). The current analysis focuses on examining whether the experimental condition also increased measures of data quality (e.g., completeness/missing data, consistency of responses across survey items). We also explore the potential impact on the quality of the critical time and money questions.
Mr Kim Backström (Åbo Akademi University) - Presenting Author
Dr Alexandru Cernat (The University of Manchester)
Dr Inga Saikkonen (Åbo Akademi University)
Professor Kim Strandberg (Åbo Akademi University)
We live in a time of political polarization and democratic backsliding. To better understand such processes, it is essential to measure such complex concepts correctly. Attitudinal survey questions, which are typically used to collect data in this area, are known to be affected by measurement error such as social desirability, acquiescence, and random error. This study Investigates the effect of different types of measurement error on survey items measuring democratic support using the MultiTrait MultiError (MTME) approach, a within-person experimental design.
The study is based on two waves of the Finnish Citizens’ Opinion panel’s municipal and regional election study and estimates concurrently correlated errors (social desirability bias, acquiescence bias, method effects) and random error. The design manipulates question-wording (positive vs. negative), response scale direction (agree-first vs. disagree-first), and scale length (5 vs. 7 points) across two measurement points, while employing latent variable modeling to estimate the errors.
We also investigate differences across key groups, such as those with lower panel recruitment propensities and those inhabiting the characteristics of a ‘critical citizen.’ Finally, we will also run sensitivity analyses to determine the occurrence of memory effects and the robustness of the social desirability estimates using proxies such as the Marlow-Crowne scale and the Big 5 personality traits.
By concurrently identifying and correcting for different types of measurement error, this study contributes to a deeper understanding of democratic support and its measurement, offering insights for improving survey research. It can also lead to better question wording and response scale selection recommendations.
Ms Victoria Salinero-Bevins (European Social Survey HQ (City St George's, University of London)) - Presenting Author
Mr Nathan Reece (European Social Survey HQ (City St George's, University of London))
As the European Social Survey transitions from collecting data through face-to-face interviews to using web and paper self-completion modes, it is expected that item nonresponse will increase in the absence of interviewers. Among the 12 countries that implemented self-completion surveys during or alongside ESS Round 10, item nonresponse was generally higher than in their most recent face-to-face survey. Within self-completion modes, completions on paper generally have more item nonresponse than on web. This paper describes in detail which parts of the ESS questionnaire have been particularly susceptible to higher item nonresponse when switching modes from face-to-face to self-completion and investigates other possible factors that may be correlated with the propensity to skip questions in self-completion.
In general, we find that item nonresponse is most prevalent among questions that involve complex routing instructions in the paper questionnaire and open response questions. Nonresponse in open response questions tends to be higher among paper completions although this is not universally the case. Evidence also suggests that nonresponse patterns can be sensitive to features of the layout and graphic design of the paper questionnaire. Among web completions, we investigate whether nonresponse patterns can be explained by device types and certain demographic characteristics. Nonresponse tends to be higher among respondents answering on mobile phones. Nonresponse patterns differ according to operating system, both on mobile devices and on personal computers and tablets. The findings presented will help inform how cross-national social surveys can reduce item nonresponse when transitioning from face-to-face interviewing to self-administered web and paper questionnaires.
Mr Sebastian Vogler (Leibniz Institute for Educational Trajectories) - Presenting Author
The increasing problem of low participation rates in large-scale surveys necessitates the re-evaluation
of traditional methodologies. Self-administered online surveys have emerged as a promising
alternative, yet concerns about data quality, particularly in online access panels, persist. Attention
checks in online surveys play a crucial role to ensure high-quality data by identifying inattentive
respondents. While the implementation of online surveys has the potential to reduce costs, enhance
the composition of respondents and facilitate greater access, surveys using online access panels are
frequently accompanied by significant disadvantages: The incentive for financial or other forms of
compensation in exchange for the completion of a questionnaire might induce distortions in the
respondents' actual response behaviour by responding to questions in an inadequate, untruthful or
hasty manner just to complete the survey with minimal effort. At the same time, high failure rates in
attention checks can result in significant data exclusion, raising the need for supplementary strategies
to improve attentiveness.
This study investigates the potential of feedback-based interventions as a solution, providing
participants with feedback about failed attention checks and emphasizing the importance of high-
quality responses.We present results from an online survey with a randomized controlled trial (RCT) on
the effect of a feedback intervention on respondents’ attentiveness, response behaviour, and overall
data quality. We deployed an online survey (N=600) with 48 items including questions about work
satisfaction and career opportunities as well as health behaviour. Attention checks are incorporated as
instructed manipulation check after half of the questionnaire. The control group only received attention
check, whilst the treatment group received the feedback intervention followed by the same attention
check, again. We will present first results on the impact of feedback on the enhancement or
deterioration of attentiveness, response behaviour and the overall quality of survey data.
Dr Blanka Szeitl (HUN-REN Center for Social Sciences, University of Szeged) - Presenting Author
Dr Tamás Rudas (University of Eötvös Loránd)
In traditional theories, the accuracy of estimates is assessed relative to the true theoretical value that characterizes the population. However, in survey practice, the true population value is rarely known, and thus the assessment may become illusory. In this study, a new aspect of describing the precision of values found in surveys is discussed: how much a value observed in a survey differs from the theoretical value that could be obtained in a replication of the survey. The first finding is theoretical: the difference between two replications of a survey can be decomposed into nonresponse uncertainty (NU) and measurement uncertainty (MU). NU is the sample component; it depends on who chooses to respond / refuses to respond to a survey. MU is the measurement component, which depends on how respondents answer the questions. The second and third findings are empirical, which are based on a case study of the European Social Survey (ESS): in contrast to the general importance attributed to non-response problems, the magnitude of NU is not relevant. Both NU and MU affect multivariate analyses, which is in line with previous findings in measurement theories in survey research.