ESRA logo

ESRA 2025 Preliminary Program

              



All time references are in CEST

Survey Data Integration: Nonprobability Surveys, Administrative and Digital Trace Data 3

Session Organisers Dr Camilla Salvatore (Utrecht University)
Dr Angelo Moretti (Utrecht University)
TimeWednesday 16 July, 09:00 - 10:30
Room Ruppert Wit - 0.52

Given the declining response rates and increasing costs associated with traditional probability-based sample surveys, researchers and survey organisations are increasingly investigating the use of alternative data sources, such as nonprobability sample surveys, administrative data and digital trace data.
While initially considered as potential replacements, it is now clear that the most promising role for these alternative data sources is supplementing probability-based sample surveys. Indeed, the use of auxiliary data is a considerable opportunity as it often allows for timeliness, data with detailed time intervals, and geographical granularity, among others.
This session will discuss cutting-edge methodologies and innovative case studies that integrate diverse data sources in survey research. It will highlight strategies for improving inference, assessing data quality, and addressing biases (e.g. selection and measurement). Attendees will gain insights into the latest advancements in data integration techniques, practical applications, and future directions for survey research in an increasingly complex data environment.

Keywords: data integration, online surveys, digital trace, selection bias

Papers

Transforming UK Travel and Tourism Statistics: From Survey Methods to Admin Data

Mrs Sabina Kastberg (Office for National Statistics )
Miss Claudia Jenkins (Office for National Statistics ) - Presenting Author

Following a review of our statistics in 2019, the UK Office for National Statistics (ONS) are transforming the way we collect travel and tourism (T&T) data to deliver more efficient, accurate and coherent T&T statistics.

Since July 2024, ONS have moved away from the sole reliance on the International Passenger Survey for T&T data and instead have updated our methods to harmonise with 3 other data providers:
• ONS have harmonised with the Civil Aviation Authority; this has meant adapting our survey design to use a gate room model and port-side approach which will boost sample sizes and increase precision.
• ONS have introduced the use of household survey data (working in partnership with National Tourism Associations on the use of The Great Britain Tourism Survey) to replace interview operations for UK resident arrivals; thus, reducing costs and improving response rates.
• ONS are now working in partnership with other research agencies (the Northern Ireland Statistics and Research Agency) to share data and improve coherence.

However, the ONS’s longer-term ambition is to move to an administrative data model with machine learning methodologies. Primarily, ONS have been exploring the possibilities that short-term accommodation, mobility, and financial transaction data (e.g. Airbnb, Advanced Passenger Information and visa card data) can bring to better understand international mobility and patterns of spending inside and out of the UK.

This presentation will talk through the approach, research and complexities involved in introducing novel methodologies to measuring T&T, as well as outline some initial results from comparing T&T estimates derived from survey methods with those derived from administrative data. In particular, the presentation will provide an overview of the methodological challenges faced when replacing surveys e.g., considerations and potential solutions for linking administrative datasets with no common linkage variable.


One harmonization fits all? - Investigating the reusability of equating solutions across populations and time

Mr Matthias Roth (GESIS – Leibniz Institute for the Social Sciences) - Presenting Author
Dr Ranjit K. Singh (GESIS – Leibniz Institute for the Social Sciences)

Combining survey data from multiple sources can help to enrich time series or to increase the sample size of populations of interest. However, the design of questions (e.g. number and labeling of response options) measuring the same construct can differ widely among surveys. This leads to a need for harmonization methods which can align differences in measurements so that data can be combined and analyzed together. Recently, it has been shown that observed score equating in a random groups design (OSE-RG) efficiently harmonizes data from questions measuring latent constructs, such as attitudes and values. In essence, OSE-RG transforms the response scale of one question into the format of a second question.
However, an OSE-RG response scale transformation needs to be calibrated on data randomly drawn from the same population, at the similar times. This requirement can make it difficult to apply OSE-RG on data from non-probability samples and on time-series survey data. One way to still use OSE-RG in these settings is to re-use existing OSE-RG response scale transformations calibrated on data from probabilistic samples. In this presentation, we investigate the re-usability of OSE-RG response scale transformation across populations and over time, synthesizing results from two studies.
We used data from general population surveys to artificially create situations in which we calibrate an OSE-RG response scale transformation in one population and then reuse the calibrated response scale transformation to harmonize data from other populations or years. We find that reusability over populations and time depends on how different the calibration and application population are. The reusability also differs by construct measured. We close by discussing how the results transfer to the application of OSE-RG on data from non-probability samples.


Linking survey data on migrants and refugees to administrative register data: Consenter selectivity and cross-validation in the SOEP-CMI-ADIAB sample

Mr Manfred Antoni (IAB - Institute for Employment Research)
Mr Mattis Beckmannshagen (The German Institute for Economic Research - DIW)
Mr Markus Grabka (The German Institute for Economic Research - DIW)
Mr Sekou Keita (IAB - Institute for Employment Research) - Presenting Author
Ms Parvati Trübswetter (IAB - Institute for Employment Research)

Abstract
The SOEP-CMI-ADIAB record linkage project linked the Socio-Economic Panel (SOEP) core, migration, and innovation (CMI) samples to administrative data available at the IAB (ADIAB) to generate a new dataset available for research and policy advice. This study examines the quality of the successfully linked data in terms of selectivity. In addition, it identifies steps that can be taken as best practices in handling the data to minimize potential problems resulting from measurement inaccuracies or selectivity of the sample.

1. Selectivity analyses
Selectivity and thus lack of representativeness of the SOEP-CMI-ADIAB sample can arise at three levels.
a) At the level of consent: Were respondents with certain characteristics more likely to give consent to data linkage than others?
b) At the level of data linkage: Does the probability of success of data linkage vary based on certain characteristics of the respondents?
Logistic regressions within each relevant population are used to test whether certain characteristics had significant influence at any of the three levels. Subsequently, hints are given on how to compensate for potential selectivity by creating weighting factors that take all levels into account.

2 Cross-validation of gross compensation and part-time/full-time variable
Because of the linked data set, two sources of information exist for many respondents on the same (or at least very similar) questions. For example, the administrative data contain information on employees' gross daily pay, while the SOEP asks about gross monthly pay. At the same time, the administrative data contain information on whether employees work full or part time, while the SOEP survey data contain actual as well as contractually agreed weekly working hours.
On this basis, cross-validations can be performed. Does the SOEP information agree (approximately) with the administrative data?


Assessing the Generalizability of Imputation-Based Integration of Wearable Sensor Data and Survey Self-Reports: Insights From a Simulation Study

Ms Deji Suolang (University of Michigan - Ann Arbor) - Presenting Author
Dr Brady West (University of Michigan - Ann Arbor)

Utilizing sensor data in survey research offers advantages such as granular insights, validation of self-report errors, and increased analytical power. However, sensor data collection is costly and demands significant administrative efforts, limiting its feasibility for large-scale, representative samples. A prior study (Suolang & West, 2024) developed an imputation-based approach using the National Health and Nutrition Examination Survey (NHANES) 2011-2014, which includes both self-reported and sensor data on physical activity, to mass-impute sensor values for the same years’ National Health Interview Survey (NHIS), a much larger dataset relying solely on self-reports. This simulation builds on those results, evaluating the method’s robustness and generalizability across different scenarios.

Using simulated data sampled from a synthetic NHANES-NHIS population, we investigated five factors influencing imputation performance for two sensor-measured outcome variables (physical activity duration and activity patterns): 1) Whether participation in sensor data collection depends on observed or unobserved factors; 2) Missing rates set high at 50%, 70%, and 90%, to mimic real-world scenarios where survey data is collected from a large sample and costly sensor measurements from a smaller subset; 3) Availability of shared auxiliary variables between datasets, with varying levels of predictive strength; 4) Whether the imputation model accounts for normality violations in outcome variables; and 5) Regression-based imputation versus predictive mean matching.

Results highlight the critical role of the missingness mechanism, emphasizing the need to address non-ignorable selection bias. Imputation performance declines as missing rates increase, while additional explanatory predictors enhance the model’s resilience to missing data. We evaluated relative bias, standard errors, mean squared errors, fractions of missing information, and coverage rates of estimated means for two outcomes. This research offers practical guidance for data integration through mass imputation, identifying scenarios where such efforts and resources are justified.