Predictive Modeling and Machine Learning in Survey Research 1 |
|
Session Organisers |
Mr Christoph Kern (University of Mannheim) Mr Ruben Bach (University of Mannheim) Mr Malte Schierholz (University of Mannheim, Institute for Employment Research (IAB)) |
Time | Thursday 18th July, 09:00 - 10:30 |
Room | D24 |
Advances in the field of machine learning created an array of flexible methods for exploring and analyzing diverse data. These methods often do not require prior knowledge about the functional form of the relationship between the outcome and its predictors while focusing specifically on prediction performance. Machine learning tools thereby offer promising advantages for survey researchers to tackle emerging challenges in data analysis and collection and also open up new research perspectives.
On the one hand, utilizing new forms of data gathering, e.g. via mobile web surveys, sensors or apps, often results in (para)data structures that might be difficult to handle -- or fully utilize -- with traditional modeling methods. This might also be the case for data from other sources such as panel studies, in which the wealth of information that accumulates over time induces challenging modeling tasks. In such situations, data-driven methods can help to extract recurring patterns, detect distinct subgroups or explore non-linear and non-additive effects.
On the other hand, techniques from the field of supervised learning can be used to inform or support the data collection process itself. In this context, various sources of survey errors may be thought of as constituting prediction problems which can be used to develop targeted interventions. This includes e.g. predicting noncontact, nonresponse or break-offs in surveys to inform adaptive designs that aim to prevent these outcomes. Machine learning provides suitable tools for building such prediction models.
This session welcomes contributions that utilize machine learning methods in the context of survey research. The aim of the session is to showcase the potential of machine learning techniques as a complement and extension to the survey researchers' toolkit in an era of new data sources and challenges for survey science.
Keywords: machine learning, predictive models, data science
Dr James Wagner (University of Michigan) - Presenting Author
Dr Michael Elliott (University of Michigan)
Dr Brady West (University of Michigan)
Ms Stephanie Coffey (US Census Bureau)
Responsive survey designs implement surveys in phases. Each phase is meant to be a separate protocol with different cost and error structures. The goal is to design a series of phases with complementary error structures (e.g. the nonresponse errors are balanced out across the phases) for a fixed budget. Some work has been done to identify when phases are complementary with respect to errors. However, no work has been done to evaluate costs across phases. Without accurate cost estimates, resources may be inefficiently allocated across phases. In this presentation, we will compare statistical and machine learning methods of predicting costs for alternative designs. The first modeling strategy we employ uses multi-level models to predict the number of hours of interviewer time under two different designs. The second approach uses a machine learning method known as Bayesian Additive Regression Trees (BART). We will evaluate the predictive accuracy of the models using data from a real survey. The resulting model predictions would be used as inputs to decision-making in a responsive survey design context. We find that the BART modeling approach yields a useful approach to maximizing predictive accuracy, while the multi-level regression models offer an alternative with results that are relatively easy to interpret.
Professor Stephen McKay (University of Lincoln, UK) - Presenting Author
Panel studies face the challenge of non-response: keeping respondents within the panel over many years. Strategies towards non-response include (a) changes to data collection (different modes, financial incentives), and (b) methods of data analysis (e.g. weighting, imputation). Statistical models of non-response may be used to create ‘weighting classes’ or regression probabilities of non-response to create weights, but statistical models may not be optimised for prediction and ‘Machine learning’ (ML) methods are better created for prediction.
In (work in progress) machine learning methods are proving superior to existing regression-based models of non-response. In Understanding Society data (w1->w2), a reasonable preliminary logistic regression model achieved a prediction mean squared error of 0.183 (not the best measure but one of the simplest). A ‘model’ using just the constant term had a MSE of 0.191. But a random forest (RF) model using an identical set of independent variables achieved 0.169. The implication is that non-response weights (based on the inverse probability of response) would be more ‘accurate’ if based on the RF predictions rather than the logit prediction, if this result holds in more complex models.
Ongoing work is exploring other ML algorithms, such as gradient boosting methods and more complex sets of independent variables, with the additional aim of contrasting the sets of non-response weights that would be produced using different approaches. The ML methods also hold the prospect of identifying key variables affecting attrition – which in past research (e.g. Kern at JSM 2018) seems likely to include paradata relating to past response. Ongoing research is also considering whether response at subsequent waves (e.g. from wave 7 to 8) follows similar patterns in terms of machine learning vs standard statistical models.
Research funding from the University of Essex’s fellowship scheme is gratefully acknowledged.
Mr Nicholas Heck (City, University of London) - Presenting Author
Research combining survey methodology and machine learning so far has primarily focused on predicting survey participation or nonresponse or dealt with panel-study contexts (Buskirk & Kolenikov 2015; Kolb, Weiß, Kern 2018; Kirchner & Signorino 2018), while contactability as the preliminary step to participation in the survey fieldwork process got less attention. This project specifically looks at predicting first contact attempt success in the European Social Survey (ESS) in Great Britain, measured as whether an interviewer was able to make any contact at all with a potential respondent using competing models. In the ESS round 8, an overall proportion of 42.50% was unsuccessfully contacted at the first attempt and needed re-contacts for at least a second time. If fieldwork agencies knew a specific unit’s contactability in advance, direct and indirect fieldwork costs could be reduced or estimated better increasing fieldwork management efficiency. Since not much information about a potential unit is available prior to the first successful contact, one desired approach is using geospatial variables (e.g. an area’s unemployment rate) to anticipate contactability. Unfortunately, data on a unit’s sampling point is not available, which renders linking external data to a sampling point impossible. A first model thus includes those variables from the ESS that have shown to be “correlates-of-contact” in the literature (e.g. Groves & Couper 1998; Stoop 2005; Kreuter 2013; Vicente 2017). While the first model contains variables from the substantial part of the interview, the second model tries to predict first call contact on para-data only. For each model a logistic regression prediction is compared to several machine learning algorithms (e.g. support vector machine, classification tree, naïve Bayes). The project aims to contribute to the discussion whether machine learning is suitable for survey research questions and to give further insights to improve survey fieldwork efficiency.
Dr Ariane Würbach (Leibniz Institute for Educational Trajectories (LIfBi)) - Presenting Author
Dr Sabine Zinn (Leibniz Institute for Educational Trajectories (LIfBi))
For more than one decade survey methodologists have recognized the value of paradata – information accompanying the initiation and realization of interviews. The value of paradata has been tested within distinct contexts, among others to increase predictive power of nonresponse analysis. Common information used for this purpose is the number of contact attempts and the disposition code as well as other context information. All of these data are available for respondents and nonrespondents making them highly attractive. Though, after an initial hype, paradata have been found to not necessarily improve analyses but even harms under certain circumstances, which is why studying nonresponse bias with paradata has to be done with caution (Kreuter & Olson, 2011).
We are interested in the added value of paradata for predicting the participation status of parents and their children in a long-term study on infants. Most of the current literature focusses on wave-to-wave prediction limiting their scope to the detection of short-term effects only. We will also analyse the amount and the composition of respondents who initially started a survey and remain in the panel sample after many waves and years. Such analysis helps planning timelines, determining budgets, and developing intervening strategies for panel maintenance, if necessary.
Concretely, we use data from the Newborn Cohort of the National Educational Panel Study (NEPS) to gauge the benefits of using paradata when predicting participation status of children-parent couples. For longitudinal prediction, we apply a discrete time hazard model with competing risks for wave-participation, noncontact, wave-nonresponse, and attrition, to disentangle potential opposite effects. Respondent characteristics and paradata including context information from the current and previous waves form the set of predictors. Model selection concerning this set of predictors is achieved by an automated supervised learning procedure. We validate our prediction models using leave-one-out cross-validation.
Dr Christoph Kern (University of Mannheim) - Presenting Author
Dr Bernd Weiß (GESIS - Leibniz Institute for the Social Sciences)
Dr Jan-Philipp Kolb (GESIS - Leibniz Institute for the Social Sciences)
Nonresponse in panel studies can lead to a substantial loss in data quality due to its potential to introduce bias and distort survey estimates. Recent work investigates the usage of machine learning methods to predict nonresponse in advance, such that predicted nonresponse propensities can be used to inform the data collection process. However, predicting nonresponse in panel studies requires accounting for the longitudinal data structure in terms of model building, tuning, and evaluation. This study proposes a new framework for predicting nonresponse with machine learning and multiple panel waves and illustrates its application. With respect to model building, this approach utilizes information from multiple panel waves by introducing features that aggregate previous (non)response patterns. Concerning model tuning and evaluation, temporal cross-validation is employed by iterating through pairs of panel waves such that the train and test set moves in time. This approach is exemplified with data from a German probability-based mixed-mode panel (GESIS Panel), with which multiple machine learning models were trained and tested for 20 panel waves, respectively. Particularly random forests and extremely randomized trees in combination with features that aggregate information over multiple previous waves resulted in competitive prediction performance over all hold-out sets.