ESRA logo

ESRA 2023 Glance Program


All time references are in CEST

Survey Data Integration: Nonprobability Surveys, Administrative and Digital Trace Data

Session Organisers Dr Camilla Salvatore (Utrecht University)
Dr Angelo Moretti (Utrecht University)
TimeTuesday 18 July, 09:00 - 10:30
Room

Given the declining response rates and increasing costs associated with traditional probability-based sample surveys, researchers and survey organisations are increasingly investigating the use of alternative data sources, such as nonprobability sample surveys, administrative data and digital trace data.
While initially considered as potential replacements, it is now clear that the most promising role for these alternative data sources is supplementing probability-based sample surveys. Indeed, the use of auxiliary data is a considerable opportunity as it often allows for timeliness, data with detailed time intervals, and geographical granularity, among others.
This session will discuss cutting-edge methodologies and innovative case studies that integrate diverse data sources in survey research. It will highlight strategies for improving inference, assessing data quality, and addressing biases (e.g. selection and measurement). Attendees will gain insights into the latest advancements in data integration techniques, practical applications, and future directions for survey research in an increasingly complex data environment.

Keywords: data integration, online surveys, digital trace, selection bias

Papers

Transforming UK Travel and Tourism Statistics: From Survey Methods to Admin Data

Mrs Sabina Kastberg (Office for National Statistics )
Miss Claudia Jenkins (Office for National Statistics ) - Presenting Author

Following a review of our statistics in 2019, the UK Office for National Statistics (ONS) are transforming the way we collect travel and tourism (T&T) data to deliver more efficient, accurate and coherent T&T statistics.

Since July 2024, ONS have moved away from the sole reliance on the International Passenger Survey for T&T data and instead have updated our methods to harmonise with 3 other data providers:
• ONS have harmonised with the Civil Aviation Authority; this has meant adapting our survey design to use a gate room model and port-side approach which will boost sample sizes and increase precision.
• ONS have introduced the use of household survey data (working in partnership with National Tourism Associations on the use of The Great Britain Tourism Survey) to replace interview operations for UK resident arrivals; thus, reducing costs and improving response rates.
• ONS are now working in partnership with other research agencies (the Northern Ireland Statistics and Research Agency) to share data and improve coherence.

However, the ONS’s longer-term ambition is to move to an administrative data model with machine learning methodologies. Primarily, ONS have been exploring the possibilities that short-term accommodation, mobility, and financial transaction data (e.g. Airbnb, Advanced Passenger Information and visa card data) can bring to better understand international mobility and patterns of spending inside and out of the UK.

This presentation will talk through the approach, research and complexities involved in introducing novel methodologies to measuring T&T, as well as outline some initial results from comparing T&T estimates derived from survey methods with those derived from administrative data. In particular, the presentation will provide an overview of the methodological challenges faced when replacing surveys e.g., considerations and potential solutions for linking administrative datasets with no common linkage variable.


Using linked cohort data to help address residual confounding in analyses of population administrative data

Dr Richard Silverwood (Centre for Longitudinal Studies, University College London) - Presenting Author
Dr Gergo Baranyi (Centre for Longitudinal Studies, University College London)
Professor Lisa Calderwood (Centre for Longitudinal Studies, University College London)
Professor Bianca De Stavola (Population, Policy & Practice Department, UCL Great Ormond Street Institute of Child Health, University College London)
Professor George Ploubidis (Centre for Longitudinal Studies, University College London)
Professor Ian White (3MRC Clinical Trials Unit at UCL, University College London)
Professor Katie Harron (Population, Policy & Practice Department, UCL Great Ormond Street Institute of Child Health, University College London)

Analyses of population administrative data can often only be minimally adjusted due to the unavailability of a full set of control variables, leading to bias due to residual confounding. Cohort studies will often contain rich information on potential confounders but may not be sufficiently powered to meaningfully address the research question of interest. We aimed to use linked cohort data to help address residual confounding in analyses of population administrative data.

We propose a multiple imputation-based approach, introduced through application to simulated data in three different scenarios related to the structure of the datasets. We then apply this approach to a real-world example – examining the association between pupil mobility (changing schools at non-standard times) and Key Stage 2 (age 11) attainment using data from the UK National Pupil Database (NPD). The limited control variables available in the NPD are supplemented by multiple measures of socioeconomic deprivation captured in linked Millennium Cohort Study (MCS) data.

The proposed approach is observed to perform well when using simulated data across the different scenarios. The association between pupil mobility and Key Stage 2 attainment was attenuated after supplementing the NPD analysis with information from linked MCS data, though with a decrease in precision.

We have demonstrated the potential of the proposed approach, but more work is required to understand whether and how it can be applied more broadly. The principles underlying this innovative approach are widely applicable: any analysis of administrative data where confounder control is limited by the availability of information could potentially be strengthened by linking a subset of individuals into richer cohort data and leveraging the additional information to inform population-level analyses.


Using calibration for inference with non-probability samples when the linear model for the outcome is unsuitable

Ms Lisa Braito (University of Florence) - Presenting Author
Professor Emilia Rocco (University of Florence)

Unlike probability sampling, which relies on randomization and known, non-zero inclusion probabilities, nonprobability sampling uses arbitrary selection methods, resulting in unknown inclusion probabilities and making design-based inference methods inapplicable. Additionally, this kind of selection may introduce selection bias. Consequently, making inferences with a nonprobability sample requires using auxiliary variables at the population level. Since these are often unavailable, auxiliary data from a reference probability sample is crucial. Based on how they utilize such data, methods for making inferences with nonprobability samples can be categorized into three groups: those that use the data to estimate propensity scores for units in the nonprobability sample, those that use them to predict outcome values in the reference sample, and the so-called double robust methods, which integrate a model for propensity scores with a model for the outcome. The third category also incorporates calibration, which does not explicitly define the two models but implicitly relies on their assumptions. In the classical framework, calibration weights are derived by minimizing entropy, subject to the constraint that they replicate auxiliary variable totals from the probability sample, thereby implicitly relying on a log-linear model for the propensity score and a linear model for the outcome. First, we conduct an extensive simulation study to compare various estimators from the expanding literature on the topic and discuss their performance, focusing on their effectiveness in reducing bias under model misspecification. Then, since calibration produces robust estimates if at least one model is valid, and the linear model may be inappropriate for many outcomes, we propose using augmented calibration, recently introduced in the context of causal inference for integrating randomized controlled trials and observational data. Augmented calibration allows us to account for the nonlinear nature of the outcome while also enabling more flexible approaches to to modeling it.


Applying inverse probability weighting and mass imputation to estimate means in non-probability samples from the international Pandemic Preparedness Behaviour Survey

Dr Maartje Boer (National Institute for Public Health and the Environment (RIVM)) - Presenting Author
Dr Mart van Dijk (National Institute for Public Health and the Environment (RIVM))
Dr Saskia Euser (National Institute for Public Health and the Environment (RIVM))
Dr Floor Kroese (National Institute for Public Health and the Environment (RIVM))
Dr Jet Sanders (National Institute for Public Health and the Environment (RIVM))

In March 2024, the international Pandemic Preparedness Behaviour (PPB) monitor was initiated by the Dutch National Institute for Public Health and the Environment (n = 4636) to measure peoples’ thoughts and behaviours related to pandemic preparedness and the prevention of infectious diseases. The survey allows for cross-national comparisons between four countries. However, the sampling method differed per country: probability as well as and non-probability research panels were used, with varying recruitment strategies, limiting comparability of findings.

To improve representativeness of estimates from the PPB data and to enhance cross-national comparability, we applied Inverse Probability Weighting (IPW) and Mass Imputation (MI) to estimate means of a selected outcome: trust in the countries’ legal system. We integrated the PPB samples with data from the same countries participating in the European Social Survey (ESS; n = 6804). The ESS is a high-quality international probability-based survey on public attitudes and behaviours. ESS data were used as a benchmark to calculate IPW weights for respondents in the PPB data and to conduct MI by imputing the selected outcome for respondents in the ESS data. IPW and MI means were estimated after calibration to the countries’ distribution on gender, age, educational level, and region. The selected outcome was also assessed in the ESS survey, which we compared to the estimated means based on IPW and MI.

Findings showed that, compared to observed means of trust in the legal system in ESS, PPB mean estimates based on solely calibration were (strongly) underestimated in all four countries. In all four countries, MI improved mean estimates satisfactorily, provided a non-misspecified model, while with IPW improvement was observed in only two countries. Hence, MI was recommended as the most appropriate strategy to estimate and compare means based on the PPB data. Limitations of will be discussed.


One harmonization fits all? - Investigating the reusability of equating solutions across populations and time

Mr Matthias Roth (GESIS – Leibniz Institute for the Social Sciences) - Presenting Author
Dr Ranjit K. Singh (GESIS – Leibniz Institute for the Social Sciences)

Combining survey data from multiple sources can help to enrich time series or to increase the sample size of populations of interest. However, the design of questions (e.g. number and labeling of response options) measuring the same construct can differ widely among surveys. This leads to a need for harmonization methods which can align differences in measurements so that data can be combined and analyzed together. Recently, it has been shown that observed score equating in a random groups design (OSE-RG) efficiently harmonizes data from questions measuring latent constructs, such as attitudes and values. In essence, OSE-RG transforms the response scale of one question into the format of a second question.
However, an OSE-RG response scale transformation needs to be calibrated on data randomly drawn from the same population, at the similar times. This requirement can make it difficult to apply OSE-RG on data from non-probability samples and on time-series survey data. One way to still use OSE-RG in these settings is to re-use existing OSE-RG response scale transformations calibrated on data from probabilistic samples. In this presentation, we investigate the re-usability of OSE-RG response scale transformation across populations and over time, synthesizing results from two studies.
We used data from general population surveys to artificially create situations in which we calibrate an OSE-RG response scale transformation in one population and then reuse the calibrated response scale transformation to harmonize data from other populations or years. We find that reusability over populations and time depends on how different the calibration and application population are. The reusability also differs by construct measured. We close by discussing how the results transfer to the application of OSE-RG on data from non-probability samples.


Linking survey data on migrants and refugees to administrative register data: Consenter selectivity and cross-validation in the SOEP-CMI-ADIAB sample

Mr Manfred Antoni (IAB - Institute for Employment Research)
Mr Mattis Beckmannshagen (The German Institute for Economic Research - DIW)
Mr Markus Grabka (The German Institute for Economic Research - DIW)
Mr Sekou Keita (IAB - Institute for Employment Research) - Presenting Author
Ms Parvati Trübswetter (IAB - Institute for Employment Research)

Abstract
The SOEP-CMI-ADIAB record linkage project linked the Socio-Economic Panel (SOEP) core, migration, and innovation (CMI) samples to administrative data available at the IAB (ADIAB) to generate a new dataset available for research and policy advice. This study examines the quality of the successfully linked data in terms of selectivity. In addition, it identifies steps that can be taken as best practices in handling the data to minimize potential problems resulting from measurement inaccuracies or selectivity of the sample.

1. Selectivity analyses
Selectivity and thus lack of representativeness of the SOEP-CMI-ADIAB sample can arise at three levels.
a) At the level of consent: Were respondents with certain characteristics more likely to give consent to data linkage than others?
b) At the level of data linkage: Does the probability of success of data linkage vary based on certain characteristics of the respondents?
Logistic regressions within each relevant population are used to test whether certain characteristics had significant influence at any of the three levels. Subsequently, hints are given on how to compensate for potential selectivity by creating weighting factors that take all levels into account.

2 Cross-validation of gross compensation and part-time/full-time variable
Because of the linked data set, two sources of information exist for many respondents on the same (or at least very similar) questions. For example, the administrative data contain information on employees' gross daily pay, while the SOEP asks about gross monthly pay. At the same time, the administrative data contain information on whether employees work full or part time, while the SOEP survey data contain actual as well as contractually agreed weekly working hours.
On this basis, cross-validations can be performed. Does the SOEP information agree (approximately) with the administrative data?


Assessing the Generalizability of Imputation-Based Integration of Wearable Sensor Data and Survey Self-Reports: Insights From a Simulation Study

Ms Deji Suolang (University of Michigan - Ann Arbor) - Presenting Author
Dr Brady West (University of Michigan - Ann Arbor)

Utilizing sensor data in survey research offers advantages such as granular insights, validation of self-report errors, and increased analytical power. However, sensor data collection is costly and demands significant administrative efforts, limiting its feasibility for large-scale, representative samples. A prior study (Suolang & West, 2024) developed an imputation-based approach using the National Health and Nutrition Examination Survey (NHANES) 2011-2014, which includes both self-reported and sensor data on physical activity, to mass-impute sensor values for the same years’ National Health Interview Survey (NHIS), a much larger dataset relying solely on self-reports. This simulation builds on those results, evaluating the method’s robustness and generalizability across different scenarios.

Using simulated data sampled from a synthetic NHANES-NHIS population, we investigated five factors influencing imputation performance for two sensor-measured outcome variables (physical activity duration and activity patterns): 1) Whether participation in sensor data collection depends on observed or unobserved factors; 2) Missing rates set high at 50%, 70%, and 90%, to mimic real-world scenarios where survey data is collected from a large sample and costly sensor measurements from a smaller subset; 3) Availability of shared auxiliary variables between datasets, with varying levels of predictive strength; 4) Whether the imputation model accounts for normality violations in outcome variables; and 5) Regression-based imputation versus predictive mean matching.

Results highlight the critical role of the missingness mechanism, emphasizing the need to address non-ignorable selection bias. Imputation performance declines as missing rates increase, while additional explanatory predictors enhance the model’s resilience to missing data. We evaluated relative bias, standard errors, mean squared errors, fractions of missing information, and coverage rates of estimated means for two outcomes. This research offers practical guidance for data integration through mass imputation, identifying scenarios where such efforts and resources are justified.


Integration During the Entire Survey Process: The Usage of Survey and Administrative Data in the Austrian Student Social Survey.

Mrs Vlasta Zucha (Institute for Advanced Studies) - Presenting Author

Survey research can be supplemented and improved by using alternative data sources such as administrative data. Besides linking data at the individual level, great benefits can be gained from close linkage with administrative data across the entire research process.

The Austrian Student Social Survey (ASSS) is used to demonstrate how versatile administrative data can be utilized in survey research even without data linkage at the level of individuals. The innovative case study shows that administrative data can, for example, support the construction of the questionnaire, help with the preparation of fieldwork and support data processing. Aggregated information from administrative data is also used for weighting and plausibility checks during data cleaning. Survey and register data do not only interact during data collection and data preparation. Furthermore, survey data is supplemented on aggregate level with information from administrative data. Finally, the two data sources are also combined in the analysis. Both, similar and different content is analysed and published in combined reports (Zucha et al. 2024; Haag et al. 2024).

The inclusion of administrative data in survey research can be challenging. Differences in data structures and conceptual differences using similar variables pose challenges for data harmonisation, integration, and interpretation. To ensure that the usage of different data sources is appropriate and successful, specialised teams work together on the ASSS.

The conference contribution will provide insights into the benefits, but also the challenges of including administrative data in survey research, even if there is no possibility for linking the data on individual level.


Integrating administrative data with survey data to address attrition in longitudinal postdoctoral career research

Mr Kevin Schönholzer (University of Berne, Interfaculty Centre for Educational Research (ICER)) - Presenting Author
Mrs Barbara Wilhelmi (University of Berne, Interfaculty Centre for Educational Research (ICER))
Dr Janine Lüthi (University of Bern, Interdisciplinary Centre for Gender Studies (ICFG))

This case study examines the integration of administrative data with survey data of the Swiss National Science Foundation’s Career Tracker Cohorts (CTC) project, which has been monitoring the career trajectories of early-career researchers in Switzerland since 2018. By following 4’053 postdoctoral, researchers, the CTC study captures comprehensive insights into their grant outcomes, demographic profiles, employment conditions, research productivity, and professional perspectives. The central research interest of the CTC study is the comparison of academic and non-academic career paths and the extent to which SNSF grants have contributed to the achievement of career goals.
As with many longitudinal studies, the CTC project faces challenges related to attrition, leading to missing data that may not be missing at random (MAR or NMAR). The missing data is mainly related to the funding status: researchers without a SNSF funding drop out at a higher rate than the ones with a SNSF funding. To address this, we link the survey data with administrative records via a unique identifier. This linkage enriches the survey data, enhancing statistical power and allowing for the validation of biases and supplements for missing data.
For the analysis, we capitalize on a major advantage of the CTC study. The initial survey, conducted prior to funding decisions, achieved an exceptionally high response rate, providing nearly complete coverage of the target population. First, we assess whether there is selection bias in the administrative data and how it interacts with attrition in the survey data; second, we compare linked variables across both data sets to identify any systematic deviations. Our findings are expected to provide valuable insights into the potential and limitations of integrating administrative data with survey data in longitudinal studies.


Harnessing AI and Big Data for Model-based Inference

Dr Josef Hartmann (Verian) - Presenting Author
Dr Georg Wittenburg (Inspirient)

Model-based inference relies on best linear unbiased predictor models (BLUP) to predict the characteristics of the population elements which are not part of the sample. Typically, these models are based on the sample to identify best linear unbiased estimators.
We explore the opportunity of applying AI models, particularly Machine Learning (ML), to complement or to improve the prediction in the context of model-based inference in survey research, emphasizing their potential to quantify information from open-ended responses as well as of all other information. In the field of AI, data, incl. Internet content, can be conceptualized as a vast set of question-answer pairs. These pairs, when utilized as training data, are compressed within model parameters, enabling the model to return context-appropriate responses and their distribution. In combination with the sample information this information can be used to infer to the distribution in the superpopulation and thus to the most probable finite population values.
To illustrate this method, we plan to use it to predict income levels, generating a prediction of an income distribution. We evaluate how basic demographic information (e.g., age, gender, locality, employment status...) can help to improve the inference step.
The method is particularly valuable due to its flexibility in setting research criteria (e.g., income, political alignment, etc.) as well as in combining it with different kinds of input data (transaction data, social media data, administrative data) and data types (qualitative and quantitative). This research presents a novel approach to integrating AI into model-based inference in survey research.


Administrative Data Collection for the Social Sciences

Mrs Lisa Ziemba (Statistics Austria) - Presenting Author

Using administrative data as supplementary sources promises great analysis potential, as it provides researchers with a lot of highly reliable information on individuals, which are rarely available in traditional survey data. However, acquiring and subsequently working with administrative data is a demanding task, because their documentation mainly serves purposes within the European Statistical System and is not aimed at researchers, as it assumes critical prior knowledge. Especially international researchers face many barriers, such as language or knowledge of certain laws. This creates the need to provide a documentation that is tailored to the target audience to effectively communicate the particularities of the data at hand including the data collection and processing.
The Administrative Data Collection for the Social Sciences (ADCOL) is meant to play a pivotal role in supporting administrative data based social scientific research in Austria. It is made up of approximately 100 variables sourced from registries at the federal statistical office of Austria from six areas of life: family & demographics, housing, health, labor, income, and education. It features a research-friendly documentation describing the data products at Statistics Austria, from which the variables were selected.
In the presentation we share the research potential of the ADCOL using different use case examples. These potentials include the use of ADCOL as a register-based socio-economic household panel, as tracing individuals with the same main residence in a longitudinal manner is possible. Further, the data allows for longitudinal and multi-level analysis, as it provides information on individuals and households starting from 2015. Lastly, it is possible to link ADCOL data to other administrative data or surveys conducted in Austria.
Our presentation illustrates a best practice example of facilitating and documenting administrative data for their scientific use. The ADCOL provides a valuable data source for research to supplement traditional surveys.


What’s New in Data Integration? A Systematic Review and Data Typology.

Dr Thomas O'Toole (The University of Manchester) - Presenting Author

The integration of survey and non-survey data (e.g. administrative records, geospatial characteristics and digital trace data) can provide researchers with access to a breadth of rich and detailed information for use in applied and methodological fields. However, the extent, methods and aims for which various sources of non-survey data are integrated with survey data remains unclear.
This systematic review identifies the types and characteristics of commonly integrated survey and non-survey data sets, in addition to their integration purpose and methodology. Literature searches were conducted for peer-reviewed and pre-print articles concerning the use of data integration/linkage using Ovid (including the Cochrane Library, APAPsycInfo, Embase, Econlit and Medline), Web of Science (Core Collection), Scopus and Google Scholar (for grey literature). We also conducted snowball searches and accessed existing collections of publications from CLOSER and the UK Longitudinal Linkage Collaboration.
Results were used to construct a typology of integrated data sources available to researchers, covering survey and non-survey data types (and where to access them), and the purpose and methodology of the integration, including linkage level and consent. We further discuss the current data integration landscape and identify gaps for future exploration in data integration literature.