ESRA 2025 Preliminary Program
All time references are in CEST
Survey Data Integration: Nonprobability Surveys, Administrative and Digital Trace Data |
Session Organisers |
Dr Camilla Salvatore (Utrecht University) Dr Angelo Moretti (Utrecht University)
|
Time | Tuesday 15 July, 09:00 - 10:30 |
Room |
Ruppert Wit - 0.52 |
Given the declining response rates and increasing costs associated with traditional probability-based sample surveys, researchers and survey organisations are increasingly investigating the use of alternative data sources, such as nonprobability sample surveys, administrative data and digital trace data.
While initially considered as potential replacements, it is now clear that the most promising role for these alternative data sources is supplementing probability-based sample surveys. Indeed, the use of auxiliary data is a considerable opportunity as it often allows for timeliness, data with detailed time intervals, and geographical granularity, among others.
This session will discuss cutting-edge methodologies and innovative case studies that integrate diverse data sources in survey research. It will highlight strategies for improving inference, assessing data quality, and addressing biases (e.g. selection and measurement). Attendees will gain insights into the latest advancements in data integration techniques, practical applications, and future directions for survey research in an increasingly complex data environment.
Keywords: data integration, online surveys, digital trace, selection bias
Papers
Using calibration for inference with non-probability samples when the linear model for the outcome is unsuitable
Ms Lisa Braito (University of Florence) - Presenting Author
Professor Emilia Rocco (University of Florence)
Unlike probability sampling, which relies on randomization and known, non-zero inclusion probabilities, nonprobability sampling uses arbitrary selection methods, resulting in unknown inclusion probabilities and making design-based inference methods inapplicable. Additionally, this kind of selection may introduce selection bias. Consequently, making inferences with a nonprobability sample requires using auxiliary variables at the population level. Since these are often unavailable, auxiliary data from a reference probability sample is crucial. Based on how they utilize such data, methods for making inferences with nonprobability samples can be categorized into three groups: those that use the data to estimate propensity scores for units in the nonprobability sample, those that use them to predict outcome values in the reference sample, and the so-called double robust methods, which integrate a model for propensity scores with a model for the outcome. The third category also incorporates calibration, which does not explicitly define the two models but implicitly relies on their assumptions. In the classical framework, calibration weights are derived by minimizing entropy, subject to the constraint that they replicate auxiliary variable totals from the probability sample, thereby implicitly relying on a log-linear model for the propensity score and a linear model for the outcome. First, we conduct an extensive simulation study to compare various estimators from the expanding literature on the topic and discuss their performance, focusing on their effectiveness in reducing bias under model misspecification. Then, since calibration produces robust estimates if at least one model is valid, and the linear model may be inappropriate for many outcomes, we propose using augmented calibration, recently introduced in the context of causal inference for integrating randomized controlled trials and observational data. Augmented calibration allows us to account for the nonlinear nature of the outcome while also enabling more flexible approaches to to modeling it.
Applying inverse probability weighting and mass imputation to estimate means in non-probability samples from the international Pandemic Preparedness Behaviour Survey
Dr Maartje Boer (National Institute for Public Health and the Environment (RIVM)) - Presenting Author
Dr Mart van Dijk (National Institute for Public Health and the Environment (RIVM))
Dr Saskia Euser (National Institute for Public Health and the Environment (RIVM))
Dr Floor Kroese (National Institute for Public Health and the Environment (RIVM))
Dr Jet Sanders (National Institute for Public Health and the Environment (RIVM))
In March 2024, the international Pandemic Preparedness Behaviour (PPB) monitor was initiated by the Dutch National Institute for Public Health and the Environment (n = 4636) to measure peoples’ thoughts and behaviours related to pandemic preparedness and the prevention of infectious diseases. The survey allows for cross-national comparisons between four countries. However, the sampling method differed per country: probability as well as and non-probability research panels were used, with varying recruitment strategies, limiting comparability of findings.
To improve representativeness of estimates from the PPB data and to enhance cross-national comparability, we applied Inverse Probability Weighting (IPW) and Mass Imputation (MI) to estimate means of a selected outcome: trust in the countries’ legal system. We integrated the PPB samples with data from the same countries participating in the European Social Survey (ESS; n = 6804). The ESS is a high-quality international probability-based survey on public attitudes and behaviours. ESS data were used as a benchmark to calculate IPW weights for respondents in the PPB data and to conduct MI by imputing the selected outcome for respondents in the ESS data. IPW and MI means were estimated after calibration to the countries’ distribution on gender, age, educational level, and region. The selected outcome was also assessed in the ESS survey, which we compared to the estimated means based on IPW and MI.
Findings showed that, compared to observed means of trust in the legal system in ESS, PPB mean estimates based on solely calibration were (strongly) underestimated in all four countries. In all four countries, MI improved mean estimates satisfactorily, provided a non-misspecified model, while with IPW improvement was observed in only two countries. Hence, MI was recommended as the most appropriate strategy to estimate and compare means based on the PPB data. Limitations of will be discussed.
The Opportunities and Challenges of collecting administrative data from suppliers to inform government policies.
Dr Tansy Arthur (non-member) - Presenting Author
This presentation is delivered by a social researcher who runs UK survey operations and also has commercial experience. Current and previous survey research on under-represented groups has shown the challenges of sampling groups within a population in a cost-effective way. Capturing timely information on lived experiences can inform government policies but can be costly. One option is to collect information through suppliers who deliver services for these groups. The talk will focus on some of the commercial complications and opportunities of collecting administrative data from service suppliers. Using the example of local support services provided for UK carers, the presentation will explore the current challenges of asking service users about their experiences and the challenges of trying to do this through suppliers eg., the difficulties of capturing carer experiences when the services are provided by different third-party suppliers to different public sector organisations around the UK. The commercial contracts do not ask for user feedback as standard. However, there is a huge opportunity to collect timely feedback as suppliers interact directly with the service users and could collect useful information from these carers at source. This discussion will explore the potential challenges and benefits of collecting this administrative data, and how this might be achieved over time, with the appropriate data quality.
Integrating administrative data with survey data to address attrition in longitudinal postdoctoral career research
Mr Kevin Schönholzer (University of Berne, Interfaculty Centre for Educational Research (ICER)) - Presenting Author
Mrs Barbara Wilhelmi (University of Berne, Interfaculty Centre for Educational Research (ICER))
Dr Janine Lüthi (University of Bern, Interdisciplinary Centre for Gender Studies (ICFG))
This case study examines the integration of administrative data with survey data of the Swiss National Science Foundation’s Career Tracker Cohorts (CTC) project, which has been monitoring the career trajectories of early-career researchers in Switzerland since 2018. By following 4’053 postdoctoral, researchers, the CTC study captures comprehensive insights into their grant outcomes, demographic profiles, employment conditions, research productivity, and professional perspectives. The central research interest of the CTC study is the comparison of academic and non-academic career paths and the extent to which SNSF grants have contributed to the achievement of career goals.
As with many longitudinal studies, the CTC project faces challenges related to attrition, leading to missing data that may not be missing at random (MAR or NMAR). The missing data is mainly related to the funding status: researchers without a SNSF funding drop out at a higher rate than the ones with a SNSF funding. To address this, we link the survey data with administrative records via a unique identifier. This linkage enriches the survey data, enhancing statistical power and allowing for the validation of biases and supplements for missing data.
For the analysis, we capitalize on a major advantage of the CTC study. The initial survey, conducted prior to funding decisions, achieved an exceptionally high response rate, providing nearly complete coverage of the target population. First, we assess whether there is selection bias in the administrative data and how it interacts with attrition in the survey data; second, we compare linked variables across both data sets to identify any systematic deviations. Our findings are expected to provide valuable insights into the potential and limitations of integrating administrative data with survey data in longitudinal studies.
Harnessing AI and Big Data for Model-based Inference
Dr Josef Hartmann (Verian) - Presenting Author
Dr Georg Wittenburg (Inspirient)
Model-based inference relies on best linear unbiased predictor models (BLUP) to predict the characteristics of the population elements which are not part of the sample. Typically, these models are based on the sample to identify best linear unbiased estimators.
We explore the opportunity of applying AI models, particularly Machine Learning (ML), to complement or to improve the prediction in the context of model-based inference in survey research, emphasizing their potential to quantify information from open-ended responses as well as of all other information. In the field of AI, data, incl. Internet content, can be conceptualized as a vast set of question-answer pairs. These pairs, when utilized as training data, are compressed within model parameters, enabling the model to return context-appropriate responses and their distribution. In combination with the sample information this information can be used to infer to the distribution in the superpopulation and thus to the most probable finite population values.
To illustrate this method, we plan to use it to predict income levels, generating a prediction of an income distribution. We evaluate how basic demographic information (e.g., age, gender, locality, employment status...) can help to improve the inference step.
The method is particularly valuable due to its flexibility in setting research criteria (e.g., income, political alignment, etc.) as well as in combining it with different kinds of input data (transaction data, social media data, administrative data) and data types (qualitative and quantitative). This research presents a novel approach to integrating AI into model-based inference in survey research.