Handling missing data 1 |
|
Chair | Professor George Ploubidis (University College London ) |
Coordinator 1 | Mr Brian Dodgeon (University College London) |
In many nonobligatory survey studies, a significant proportion of variables are subject to missing not at random (MNAR) processes in the sense that the probability of a missing value in a variable depends on the variable itself. Especially in longitudinal studies such as the German National Educational Panel Study (NEPS) nonresponse as well as attrition may lead to MNAR patterns hampering the applicability of statistical standard methods for analyzing longitudinal data. Such standard methods typically rely on the implicit assumption of no missing data at all or missing completely at random processes. Thus, they do not account for the presence of MNAR patterns in the data and corresponding statistical inference might result in invalid research conclusions.
Survey statistics provide two classes of methods dealing with longitudinal data affected by MNAR mechanisms. The first class uses external information on the variables with missing values to compensate for distortions in the model likelihood. The second class explicitly models the selection bias arising because of the MNAR mechanism. The probably best-known method of this kind is Heckman’s two stage approach. Further methods are the multiple imputation (MI) by multivariate regression implementing a shift parameter compensating for deviations from the missing at random (MAR) model and the full maximum likelihood (FIML) approach including correlates of missingness into the statistical model.
In this talk, we study the capabilities and obstacles of these three approaches (i.e. Heckman, MI, and FIML) when modelling competence development of German students in mathematics with incomplete data in the competence score variable. For this purpose, we apply a linear panel model subject to missing values in the outcome variable. For analysis we use data from the NEPS Starting Cohort 3.
Selection bias, in the form of incomplete or missing data, is unavoidable in longitudinal surveys. It results in smaller samples, incomplete histories, lower statistical power and it is well known that unbiased estimates cannot be obtained without properly addressing the implications of incompleteness. Rubin classifies all types of missing data into three categories: i) Missing Completely At Random (MCAR); ii) Missing At Random (MAR); iii) Not Missing At Random (NMAR) (R. J. A. Little & Rubin, 1989, 2002). One of the advantages of population based longitudinal surveys is that they are representative of their target population, but attrition posits a threat to this. Within Rubin’s framework it has been shown that the 1958 cohort data are either MAR or NMAR. If the data are MAR, then observed variables account for selection. Given these, the sample is representative/balanced. On the contrary if the data are NMAR, observed variables do not account for selection and the sample is not representative of its target popualtion. Maximising the plausibility of the MAR assumption increases the likelihood of survey being representative of its target population despite the presence of missing data. Another implication is that despite MAR and NMAR being untestable, if a “gold standard” for the target population exists (the census or administrative data for example), we could test whether after accounting for missing data the distribution of target variables in a survey is similar to that observed in the population. We note that this is not a formal test for the NMAR assumption, since even when distributions are similar the target variables can still be MNAR, but arguably the bias (for this specific variable) is negligible.Having systematically identified the predictors of response in all waves of the 1958 British Birth Cohort, we ask whether maximising the plausibility of the MAR assumption is adequate to maintain the characteristics of the original (without attrition) sample as well as its representativeness of the target population in later waves of the study where significant (40%) unit non response has occurred. Our preliminary findings show that after maximising the plausibility of the MAR assumption in the 1958 cohort, using predictors of unit non-response identified within the CLS Missing Data Strategy and accounting for missing data with appropriate Multiple Imputation, we were able to recover the original distributions of the birth survey for three key variables for those present on the survey at age 55 (n = 9370): Birthweight, maternal smoking during pregnancy and paternal occupation social class at birth. We are currently working on comparing the distributions we obtain after Multiple Imputation with the known population distribution of outcomes such as self-rated health, disability and life satisfaction and expect these results to be available and presented at the ESRA conference.
Considering that flexible solutions that operate under the Missing At Random (MAR) assumption have become increasingly popular among researchers, a pertinent question is how we can make MAR more plausible. For example, it is widely accepted that under MAR the Full Information Maximum Likelihood (FIML), Multiple Imputation (MI) and Inverse Probability Weighting (IPW) methods return unbiased estimates in the presence of missing data. However, most studies employing these methods rely on a largely arbitrary selection of auxiliary variables that are used as predictors of missing data. It follows, that with a theory driven variable selection process the extent to which the plausibility of the MAR assumption is maximised for a given dataset is not known. Failing to identify all systematic information that drives missingness will invariably make the data NMAR and results from any MAR approach will likely be biased. We describe a three step data driven approach that allowed us to empirically identify all possible predictors of wave specific non response in all available waves of the 1958 cohort. We employed univariate and multivariable logistic regression models within the context of sequential multiple imputation that allowed us to appropriately handle non monotone missing data patterns. Our approach allowed us to empirically identify all systematic information that maximises the plausibility of the MAR assumption in the 1958 British birth cohort. We found that both early life and adult characteristics are predictors of wave specific non-response. In accordance with the literature, indicators of socio-economic position, demographic characteristics and home moves strongly predicted wave specific non-response. However, variables not traditionally used as predictors of non-response, such as indicators of social capital, early life mental health, cognitive ability and health related behaviour were also associated with wave specific non-response.
Flexible solutions and software exist for methods that assume data are Missing At Random (MAR). Despite our attempts to improve the plausibility of the MAR assumption in longitudinal studies, we cannot discount the possibility that the missing data generating mechanism in longitudinal studies is MNAR. This implies that major causes of missingness were not available, or by omission were not included in the models. Maximising the plausibility of MAR implies that bias will only be possible if unmeasured variables are associated with missingness strongly and independently of the set of variables empirically identified as predictors of wave specific non-response.
In this paper we use these predictors and consider several scenarios for the missing data generating mechanism. Our working example concerns the association between cognitive ability at age 11 and childlessness up to age 42. We compare and contrast results from a Complete Case Analysis that assumes MCAR, with Multiple Imputation under both MAR and MNAR scenarios. We find that MI returns robust estimates even in strong departures from the maximum plausibility of the MAR assumption.
Working example research question:
In the NCDS cohort, using Complete Case Analysis we find that, of those cohort members who did cognitive tests at age 11, by age 42 a U-shaped pattern of childlessness emerges: those with the lowest and highest scores are most likely to be childless, controlling for important childhood predictors (birthweight, breastfeeding, parental SEP, mother smoking/working).
But could this result be biased? For instance, those with low cognitive scores are perhaps more likely to drop before age 42. We employ Multiple Imputation, using our auxiliary variables which most effectively predict missingness at future waves.
Objective 1: We give examples using our identified predictors of wave-specific non response.. This allows us to compare different approaches that operate under the MAR and MNAR assumptions.
Analyses carried out:
• Complete Case Analysis (CAC)
• Mean Replacement Imputation (or missing data category) for exposure and confounders
• MI with all identified predictors of response for ages 0,7,11, 16 and 42 (not imputing outcome)
• MI with all identified response predictors for ages 0-42 (imputing outcome)
• MI with predictors of response up to age 16 (outcome imputed)
Objective 2:How well do methods that assume data are Missing At Random perform when the missing data generating mechanism is not ignorable?
Analyses carried out:
Our ‘benchmark’ (reference) for Objective 2 is the final result of Objective 1 (MI with predictors up to age 16, outcome imputed). We compare that with these results:
• MI with all predictors of non-response up to age 16, but removing the most important predictor of response up to age 11
• MI with all predictors of non-response up to age 16,but removing the two most important predictors of response up to age 16.
In all scenarios we find the U-shaped trend is confirmed (level of childhood cognition predicting adult childlessness). That is, the two ‘mis-specified’ MI models in Objective 2 return results with a similar substantive interpretation.