All time references are in CEST
The Use of Machine Learning Techniques When Dealing with Missing Data |
|
Session Organisers | Dr Barbara Felderer (GESIS) Dr Christian Bruch (GESIS) |
Time | Tuesday 18 July, 09:00 - 10:30 |
Room |
Machine Learning methods became very popular in survey research in recent years because of their ability to deal with large data sets and dependencies between variables and their flexibility in modeling complex relationships. While machine learning is optimized for prediction tasks, explanation might be improved using causal machine learning. While both approaches have frequently been used to model social behavior, their use for typical survey methodological and survey statistical applications such as analyses based on missing data is yet rather rare.
Concerning missing data one can think of at least three areas of application:
1. Understanding nonresponse behaviour (e.g., in the context of nonresponse bias analyses) and development of targeted recruitment designs
2. Modelling unit nonresponse and generate adjustment weights
3. Modelling item nonresponse when applying imputation methods
We invite papers addressing the potential of machine learning methods to deal with any kind of missing data. Contributions from fields beyond social science that tackle the challenges of nonresponse are also welcome.
Keywords: machine learning, missing data, nonresponse, weighting, imputation
Dr Barbara Felderer (GESIS) - Presenting Author
Dr Christian Bruch (GESIS)
Mr Björn Rohr (GESIS)
Nonresponse weighting is an important tool for improving the representativeness of surveys, e.g. weighting respondents according to their inverse propensity to respond (IPW). IPW estimates a person's propensity to respond based on characteristics that are available for respondents and nonrespondents. While logistic regression is typically used to estimate the response propensity, machine learning methods offer several advantages: they allow for very flexible estimation of relationships and the inclusion of a large number of potentially correlated predictor variables. ML methods are known to predict values very accurately. However, it is also known that the estimation of relationships of the weighting variables and the response propensity suffers from regularization bias. With regard to weighting, it is unclear which of these properties is more relevant and has a greater influence on the quality of the weighted estimate. In this study, we address the question of whether machine learning methods outperform logistic regression in performing IPW.
In a simulation study that mimics the three nonresponse models (separate cause model, common cause model, survey variable cause model) and varies the number of features that affect nonresponse, we apply IPW weighting using five different prediction models: Regression Trees (CART), Random Forest, Boosting, Lasso, and Logistic Regression. We conclude the analysis with an application to voting decisions collected in the German Internet Panel.
Machine learning methods perform similarly well to logit regression and lead to a lower variance in the estimates than logit regression. Overall, the advantage of an excellent prediction seems to outweigh the disadvantages of regularization bias.
The presentation provides guidance on how to improve the weighting of surveys, which is a crucial task when drawing conclusions about the general population from a survey.
Professor Heather Kitada Smalley (Willamette University) - Presenting Author
The U.S. Census Bureau conducts the American Community Survey (ACS) every year, providing valuable data for researchers and policymakers. Among its key data quality metrics are unit non-response and allocation rates, which help gauge the accuracy of survey responses and are often used as indicators of missing or incomplete data. This research relies solely on publicly available resources from the U.S. Census Bureau, including the Public Use Microdata Sample (PUMS), data dictionaries, and questionnaires for each year. The data is restructured to follow the sequence in which a respondent would encounter the survey, allowing for the identification of missing data and NAs, which are often influenced by skip logic. These missing values can provide valuable insights into nonresponse behavior and help researchers identify areas where respondents are more likely to skip or abandon questions. Unsupervised machine learning techniques are then used to identify patterns and clusters in nonresponse behavior, highlighting key factors that may contribute to missing or incomplete data. These insights can be used to identify specific patterns of response behavior, such as particular demographic groups or survey sections where nonresponse is more prevalent. In the future, these findings could help create target interventions designed to reduce missing responses and prevent survey drop-offs, ultimately improving the quality and completeness of survey data.
Mr Christian Bruch (GESIS Leibniz Institute for the Social Sciences) - Presenting Author
Mr Julian Axenfeld (German Institute for Economic Research (DIW Berlin)German Institute for Economic Research (DIW Berlin))
In recent years, machine learning methods such as random forests, neural networks or k-means clustering are often discussed as means to improve the imputation of missing values in social surveys over the established imputation methods.
However, machine learning techniques are often based on complex algorithms and many components that have to be estimated. Furthermore, many of these algorithms were originally developed for “big data” problems in which tremendous amounts of observations and variables are used, for instance, to predict a target variable as accurately as possible. Survey data imputation is different though: There are often only up to a few thousand observations available, and the main goal of imputation typically is preserving the relationships between all the different variables, rather than accurately predicting variables using a (data-driven) selection of predictor variables. In addition, in survey data many relationships of interest tend to be relatively weak, which makes it more difficult for imputation techniques to detect and reproduce them accurately even with the established imputation methods. Therefore, the question is if (and to what extent) machine learning methods are capable of reproducing relationships in the data in practice.
To evaluate the procedures, we will use Monte Carlo Simulation studies. For ensuring realistic conditions in the simulation studies, we will use real survey data in which we will simulate item nonresponse. In this presentation, we will show first findings on how different procedures affect correlation estimates after imputation.
Mr Michael Bergrab (LIfBi – Leibniz-Institut für Bildungsverläufe e.V.) - Presenting Author
Variable and model selection is a cornerstone of modern statistical analysis and machine learning,
balancing the trade-off between bias and variance to achieve optimal model performance. This process
often involves either theory-driven selection, guided by prior knowledge, or data-driven statistical
methods to identify key predictors from among p variables. While many statistical and machine learning
techniques separate variable selection from parameter estimation, the Bayesian framework inherently
integrates these processes. To address the challenge of missing data, an extended Bayesian approach is
adopted that simultaneously imputes missing values and performs variable selection and estimation in a
unified procedure. Using data from the German National Educational Panel Study (NEPS), this Bayesian
routine is compared with alternative methods such as LASSO, ridge regression, and elastic net or machine
learning based boosting algorithms, which require multiple imputation to handle missing values.
Preliminary findings suggest strong comparability across techniques, despite the differing computational
demands. These results highlight the importance of assumptions underlying both theory-driven and data-
driven variable selection, emphasizing their critical influence on model outcomes.
Dr Darcy Morris (U.S. Census Bureau) - Presenting Author
Declining response rates and data collection interruptions are resulting in missing data complexity that traditional missing data techniques used in Census Bureau survey processing may not flexibly capture. At the same time, availability and linkability of administrative records, third party, and previous census/survey data has improved allowing for more informative response propensity models. These developments lend themselves to the study of data-driven enhancements on inverse probability weighting (IPW) methods to adjust for unit nonresponse. We study and compare the use of traditional statistical models and machine learning algorithms applied to complex survey data for model-based IPW nonresponse adjustment using auxiliary sources with multiple years of American Community Survey data. We share various measures for model comparisons, for evaluation of outcome estimates, and for visualizing geographically-differentiated results.