ESRA 2025 Preliminary Program
All time references are in CEST
Missing Data, Selection Bias and Informative Censoring in Cross-Sectional and Longitudinal Surveys 2 |
Session Organisers |
Dr Angelina Hammon (SOEP, DIW Berlin) Mx Char Hilgers (SOEP, DIW Berlin) Professor Sabine Zinn (SOEP, DIW Berlin)
|
Time | Wednesday 16 July, 16:00 - 17:30 |
Room |
Ruppert 002 |
Sample selection bias, item non-response, and dropouts (a form of censoring) are common challenges in large-scale population surveys. In longitudinal surveys, selection bias occurs at the start, item non-response during the survey, and dropout (censoring) at the end. In cross-sectional surveys, selection bias and non-response are the primary sources of missing data. These issues can severely impact the quality of analysis and the validity of inferences if not properly addressed.
A special challenge for analysis occurs when the mechanism driving one of those phenomena depends (additionally) on unobserved information, making the missing data not random and potentially leading to non-ignorable selection bias, informative censoring or non-ignorable missing data. Consequently, it is crucial to assess the robustness of results under different plausible assumptions about the missing-data, selection, or censoring mechanisms, when it seems plausible that standard assumptions may not hold.
In this session, we welcome research on novel and innovative methods to prevent misleading inference under one or several of the described challenges related to incomplete, biasedly selected, or censored survey data. This research might cover:
1. Use cases showing the harm of non-ignorable selection bias, informative censoring, or non-ignorable missing data.
2. Novel approaches for detecting (non-ignorable) selection bias in traditional surveys and non-probability samples.
3. Novel imputation procedures, likelihood-based approaches, machine learning and tree-based methods and Bayesian estimation techniques to address (non-ignorable) missing data and/or informative censoring.
4. Methods for conducting sensitivity analyses in cases where deviations from missing at random mechanisms are realistic.
Keywords: (non-ignorable) missing data, multiple imputation, (non-ignorable) selection bias, informative censoring, panel dropout, missing not at random, sensitivity analysis
Papers
Improving the quality of statistical matching using auxiliary information
Dr Angelo Moretti (Utrecht University) - Presenting Author
Professor Natalie Shlomo (University of Manchester)
National Statistical Institutes are increasingly interested in integrating datasets that encompass a wide range of social domains. Statistical matching techniques provide a framework to combine such data sources using common variables, when the the single datasets include different statistical units from the same target population.
A common challenge, however, arises from the assumption of conditional independence between variables observed in different datasets. To address this issue, an auxiliary dataset containing all variables jointly can be used to enhance statistical matching by including information of the correlation structure across datasets.
We propose modifying the prediction models from the auxiliary dataset through a calibration step and show that we can improve the outcome of statistical matching in a variety of scenarios.
The approach is evaluated through large-scale model and design-based simulation studies and an application involving the United Kingdom (UK) European Union Statistics on Income and Living Conditions and the UK Living Costs and Food Survey.
Analyzing Potential Non-Ignorable Selection Bias in an Off-Wave Mail Survey Implemented in a Long-Standing Panel Study
Dr Brady West (University of Michigan) - Presenting Author
Mrs Heather Schroeder (University of Michigan)
Typical design-based methods for weighting probability samples rely on several assumptions, including the random selection of sampled units according to known probabilities of selection and ignorable unit nonresponse. If any of these assumptions are not met, weighting methods that account for the probabilities of selection, nonresponse, and calibration may not fully account for the potential selection bias in a given sample, which could produce misleading population estimates. We investigate possible selection bias in the 2019 Health Survey Mailer (HSM), a sub-study of the longitudinal Health and Retirement Study (HRS). The primary HRS data collection has occurred in “even” years since 1992, but additional survey data collections take place in the “off-wave” odd years via mailed invitations sent to selected participants. While the HSM achieved a high response rate (83%), the assumption of ignorable probability-based selection of HRS panel members may not hold due to the eligibility criteria that were imposed.
To investigate this possible non-ignorable selection bias, our analysis utilizes a novel analysis method for estimating measures of unadjusted bias for proportions (MUBP), introduced by Andridge and colleagues in 2019. This method incorporates aggregate information from the larger HRS target population, including means, variances, and covariances for key covariates related to the HSM variables, to inform estimates of proportions. We explore potential non-ignorable selection bias by comparing proportions calculated from the HSM under three conditions: ignoring HRS weights, weighting based on the usual design-based approach for HRS “off-wave” mail surveys, and using the MUBP adjustment. We find examples of differences between the weighted and MUBP-adjusted estimates for four out of the 10 outcomes that we analyzed. However, these differences are modest, and while this result gives some evidence of non-ignorable selection bias, typical design-based weighting methods proved largely sufficient.
Mitigation of selection bias in non-probability samples under non-ignorable selection mechanisms
Dr Ramón Ferri-García (University of Granada) - Presenting Author
Mr José Juan García-Rodríguez (University of Granada)
Professor María del Mar Rueda (University of Granada)
Non-probability surveys have become a common choice for researchers because of their ease of implementation and their cheaper costs. However, the samples produced by such surveys are often affected by selection biases, such as coverage errors or self-selection of participants. Several techniques have been developed to mitigate the biases that these errors could produce in the final estimates, such as Propensity Score Adjustment, Statistical Matching or Doubly Robust estimators. These methods assume that the selection mechanism is ignorable, this is, the participation in such surveys is determined by some auxiliary variables that may be related to the variable of interest of the study. For many applications this assumption does not hold, as the participation in the survey is directly related to the variable of interest itself. On such occasions, adjustments for selection bias will not be able to remove it because they might not represent the participation mechanism properly. For example, self-selection for political surveys may be driven by interest in politics, meaning that estimations of political engagement from those surveys will be biased regardless of the estimators we use.
In this study, we do a revision of the current literature on non-ignorable selection biases, which has been developed mainly in the non-response context, and we try to adapt some of the approaches to the non-probability sampling context. To do so, we do a taxonomy of the type of non-ignorable selection mechanisms that may occur in real situations and explore how to address bias mitigation in each one using two-step modelling approaches.
Data visualization for incomplete datasets in R
Ms Hanne Oberman (Utrecht University) - Presenting Author
In many data analysis efforts, missing data are conveniently ignored. With default settings such as ‘list-wise deletion’ in analysis software, analysists need not even bother with the ubiquitous problem of incomplete data. I argue that this is wasteful: not only can missing data bias analysis results if not addressed well, but moreover, the missing data itself can provide valuable insights into the phenomena of interest.
The visualization of incomplete data can uncover associations and intricacies between variables that may otherwise go overlooked. Which, in turn, can be leveraged in amending the missingness by means of imputation. The R package {ggmice} aids data analysts in exploring the missing parts of their data. In this presentation, I will showcase the use and usefulness of a data visualization workflow for incomplete datasets in R.
Correcting Selection Bias in Contingency Tables
Ms An-Chiao Liu (Utrecht University) - Presenting Author
Dr Sander Scholtus (Statistics Netherlands)
Professor Katrijn Van Deun (Tilburg University)
Professor Ton De Waal (Statistics Netherlands, Tilburg University)
When inferring population characteristics from a nonprobability sample, i.e., a sample that does not come from a known sampling scheme, it is crucial to correct the possible selection bias therein. Often selection bias correction methods focus on estimating the means or totals of target variables on the population level. However, researchers are often also interested in estimates within some subgroups of the population. In this paper, we apply two small area estimation methods on nonprobability samples. One is a design-based method using iterative proportional fitting, and the other one is a model-based method based on a hierarchical Bayesian model. These methods are combined with an often-used method for selection bias correction, namely pseudo-weighting. A simulation study and an application on a real dataset are presented. Although we do not find one method that is suitable in all scenarios, there are some patterns we observed that may assist the choice of method in the future.