Matching, weighting and related techniques for the estimation of causal effects |
|
Chair | Dr Bruno Arpino (Pompeu Fabra University ) |
In England, the General Certificate of Secondary Education (GCSE) is the main qualification taken by 16 year-old students. The grades achieved at GCSE affect progression to higher levels of education and career opportunities. The GCSE is currently offered separately by four awarding organisations (AOs) and for each GCSE subject AOs may offer one or more distinct specifications, the standards of which are supposed to be comparable. This means that students with similar levels of ability that are taking qualifications in the same subject but offered by different AOs should obtain the same grade. However, speculations that some specifications are more lenient than others can lead schools to strategically select the specification they choose to offer. Although multiple sources of information based on both subject experts’ opinions and statistical evidence are taken into account by each AO during awarding, this is not a guarantee that standards are maintained.
This paper proposes the use of propensity-score matching (PSM) to monitor the comparability of standards after awarding. Following the seminal work by Rosenbaum and Rubin (1983), PSM methods have been increasingly used by social scientists and have now become a standard tool in dealing with data from observational studies and in particular to study the effects of public interventions (Caliendo and Kopeinig, 2008). More generally, PSM can be used to draw inferences from non-randomly selected samples. The aim of this paper is to show that, in educational assessment, the PSM methods can be implemented to monitor the comparability of qualification standards.
The methods currently used for this purpose are based on measures of ability, such as prior or concurrent attainment in other subjects. This paper shows that adopting PSM methods allows the use of a broader set of students’ characteristics and, therefore, takes into account a much larger number of factors potentially influencing their performance. In this way, it is possible to disentangle the ‘true’ causal effect of taking one specification rather than any other, once all other observable characteristics are controlled for. It should be noted that, in this context, the use of PSM has not only been considered as an alternative statistical procedure, but also as a way to broaden the definition of comparability and consider a more comprehensive set of characteristics rather than sole ability.
Exploiting data from the National Pupil Database provided by the Department for Education (which contains a rich set of information on students and schools), PSM was performed on a number of subjects. Different models for PS were trialled to deal with the hierarchical structure of the data (students within schools). In particular, single-level models and two-level models for PS estimation (Arpino and Mealli, 2011) are presented along with results from alternative matching procedures (e.g. coarsened matching).
Analyses show that, compared to other methods, PSM works particularly well when the two specifications under scrutiny are taken by groups of students with different characteristics. Findings also suggest that results are robust to alternative PS estimation models and to alternative matching procedures.
Propensity Score Matching (PSM) is a widely used technique to estimate causal effects under the assumption of strong ignorability of treatment assignment. More recently, Coarsened Exact Matching (CEM) has been suggested as an alternative technique. CEM belongs to the class of Monotonic Imbalance Bounding matching methods, which makes this approach theoretically more appealing than PSM. However, scarce evidence exists on the comparative performance of the two techniques under different scenarios. Using Monte Carlo simulations we generate various scenarios that differ with respect to the complexity (non linearity and / or non additivity) of the treatment and outcome generating models. Under these different scenarios we compare the performance of PSM and CEM. One of the difficulties of implementing PSM in the practice is to find a model of the propensity score that guarantees a good balance in each covariate’s distribution. Machine learning techniques have been found to guarantee considerable better performances in terms of balance and bias reduction with respect to standard logistic regression. Moreover, we also consider a new algorithm that combines CEM and PSM. This new algorithm, that we label as CEM-PSM, is expected to perform better than each technique used separately in most scenarios. Therefore we compare the performance of: 1) PSM with the propensity score model estimated with logistic regression; 2) PSM with the propensity score model estimated using random forests; 3) CEM; 4) CEM-PSM with propensity score model estimated either via a standard logit model or via random forest. We assess the performance of each approach with respect to covariate balance, bias and MSE of causal effect estimators. We also consider estimation of heterogeneous treatment effects. In particular, we consider a simple case where the treatment interacts with a categorical covariate. Estimation of category-specific treatment effects may be of interest. In this case, we expect CEM and CEM-PSM methods to be more useful than PSM alone.