ESRA 2025 Preliminary Program
All time references are in CEST
Missing Data, Selection Bias and Informative Censoring in Cross-Sectional and Longitudinal Surveys |
Session Organisers |
Dr Angelina Hammon (SOEP, DIW Berlin) Mx Char Hilgers (SOEP, DIW Berlin) Professor Sabine Zinn (SOEP, DIW Berlin)
|
Time | Wednesday 16 July, 14:00 - 15:00 |
Room |
Ruppert 0.33 |
Sample selection bias, item non-response, and dropouts (a form of censoring) are common challenges in large-scale population surveys. In longitudinal surveys, selection bias occurs at the start, item non-response during the survey, and dropout (censoring) at the end. In cross-sectional surveys, selection bias and non-response are the primary sources of missing data. These issues can severely impact the quality of analysis and the validity of inferences if not properly addressed.
A special challenge for analysis occurs when the mechanism driving one of those phenomena depends (additionally) on unobserved information, making the missing data not random and potentially leading to non-ignorable selection bias, informative censoring or non-ignorable missing data. Consequently, it is crucial to assess the robustness of results under different plausible assumptions about the missing-data, selection, or censoring mechanisms, when it seems plausible that standard assumptions may not hold.
In this session, we welcome research on novel and innovative methods to prevent misleading inference under one or several of the described challenges related to incomplete, biasedly selected, or censored survey data. This research might cover:
1. Use cases showing the harm of non-ignorable selection bias, informative censoring, or non-ignorable missing data.
2. Novel approaches for detecting (non-ignorable) selection bias in traditional surveys and non-probability samples.
3. Novel imputation procedures, likelihood-based approaches, machine learning and tree-based methods and Bayesian estimation techniques to address (non-ignorable) missing data and/or informative censoring.
4. Methods for conducting sensitivity analyses in cases where deviations from missing at random mechanisms are realistic.
Keywords: (non-ignorable) missing data, multiple imputation, (non-ignorable) selection bias, informative censoring, panel dropout, missing not at random, sensitivity analysis
Papers
Averaging Non-Probability Online Surveys to Avoid Maximal Estimation Error
Mr Alexander Murray-Watters (University of California Irvine)
Dr Stefan Zins (Institute for Employment Research ) - Presenting Author
Professor Joseph W. Sakshaug (Institute for Employment Research )
Dr Carina Cornesse (GESIS)
Data from online non-probability samples are often analyzed as if they were based on a simple random sample drawn from the general population. As the exact sampling frame for these non-probability samples are usually unknown, there is no general method to construct unbiased estimators. This raises the question of whether estimates based on online non-probability samples are consistent across sample vendors and concerning estimates based on probability samples. To
address this question, we analyze data collected from 8 different online non-probability sample vendors and one online probability-based sample. We find that estimates from the different non-probability samples can be very inconsistent. We suggest averaging estimates across multiple vendor samples to avoid the risk of a maximum estimation error. We evaluate several averaging approaches, including a LASSO regression procedure which identifies a subset of vendors that,
when averaged, produce estimates that are more consistent with the reference probability-based estimates, compared to any single vendor.
Our results show that estimates based on different vendors’ samples display different selection biases, but there is also some commonality
among some vendor-specific estimates, thus there could be strong gains in estimation precision by averaging across a selection of multiple non-probability sample vendors.
Filling in the Blanks: Augmenting Survey Data Imputation with External Data and Rubin's SIR Algorithm
Ms Char Hilgers (DIW Berlin, Socio-Economic Panel) - Presenting Author
Professor Sabine Zinn (DIW Berlin, Socio-Economic Panel)
Multiple imputation of missing values in survey data analysis is a state-of-the-art technique. Typically, methods like multivariate imputation by chained equations (mice, van Buuren 2018) are employed, replacing missing values on a variable-by-variable basis. The information used for imputation usually comes from the survey dataset being analysed. Valid analysis results are achieved when the missing values are either missing completely at random (MCAR) or missing at random (MAR). However, the situation becomes more complex if the values are missing not at random (MNAR).
There are some approaches handle this issue. One approach incorporates sensitivity analyses into the imputation to make it as robust as possible. Alternatively, the data set to be imputed can be enriched with further information, so that an MNAR mechanism becomes MAR, and thus the imputation and analysis of the imputed data can be valid. The advantages of this approach are clear, but often the full range of variables of the data set is already included in the imputation, and still the suspicion of MNAR remains.
We present a new method that integrates external data into the mice imputation process to reduce the risk of MNAR and better justify the assumption of a MAR mechanism.
Specifically, we integrate Rubin's SIR (Sampling/Importance Resampling) algorithm (Rubin 1987) into the mice framework to incorporate external distribution information for the variable of interest. Importance ratios, derived from the differences between the external distribution and the survey data's estimated distribution, guide the selection of replacement values for missing data. We also provide an estimate of uncertainty introduced by the method.
We demonstrate the effectiveness of our new approach with a simulation, involving the imputation of a typical income variable. Additionally, we apply this method to two datasets from the German Socio-Economic Panel Study.
Data Integration Using Doubly Robust and Bayesian Meta-Analysis Methods for Probability and Convenience Samples: Analysis of Natsal-4
Dr Tommy Nyberg (University of Cambridge) - Presenting Author
Dr Shaun Seaman (University of Cambridge)
Professor Andrew Copas (University College London)
Ms Soazig Clifton (University College London; National Centre for Social Research; University of Glasgow)
Dr Elise Paul (University College London)
Professor Catherine H Mercer (University College London)
Professor Pam Sonnenberg (University College London)
Ms Katharine Sadler (University College London; National Centre for Social Research)
Dr Anne M Presanis (University of Cambridge)
Background
Natsal-4 is a study of sexual and reproductive health, attitudes and lifestyles in Great Britain in 2022-2024. Here, we describe methods applied to integrate the multimodal survey data, to estimate the population prevalence of 24 outcomes concerning sexual practices, health, reproduction, and other behaviours.
Methods
Natsal-4 included two probability surveys (PSs) collected through in-person or telephone interview, or online questionnaire; and an online panel convenience survey (OPCS). Additional historical data were available from the Natsal-3 PS and four OPCSs collected in 2012. We used multiple imputation to address item nonresponse, and calculated weighted estimates based on each sample separately. We propose two data integration approaches. (1) Doubly robust (DR) estimation combines an inverse probability weighting and a mass imputation estimator using covariates from the OPCS and reference PSs; we used demographic, health, attitude and behavioural covariates. (2) A Bayesian meta-analysis (BMA) hierarchical model that assumed differences between estimates from OPCSs versus PSs were drawn from a common distribution within domains of related outcomes. We used the 2012 data to elicit prior information about such differences.
Results
Preliminary results indicate considerable heterogeneity and nonoverlapping confidence intervals for the DR covariate-adjusted OPCS estimates compared to the PS estimates for several outcomes. The median ratio of DR-to-PS-estimates was 1.04 (range 0.83-1.82); nine of 24 DR estimates were within ±10% of PS estimates. BMA similarly indicated heterogeneity between surveys. Further, there was variation in both the magnitude and direction of the estimated differences between sample estimates in the 2012 compared to 2024 data collections.
Conclusion
The heterogeneity may reflect underlying differences between the samples, mode effects, and changes over time. Further evaluation is ongoing of methods to perform pretesting for conflict and pooling of estimates accounting for uncertainty.