Contemporary Issues in the Assessment of Measurement Invariance 2 |
|
Session Organisers | Dr Daniel Seddig (University of Cologne & University of Zurich) Professor Eldad Davidov (University of Cologne & University of Zurich) Professor Peter Schmidt (University of Giessen) |
Time | Wednesday 17th July, 14:00 - 15:00 |
Room | D22 |
The assessment of the comparability of cross-national and longitudinal survey data is a prerequisite for meaningful and valid comparisons of substantive constructs across contexts and time. A powerful tool to test the equivalence of measurements is multiple-group confirmatory factor analysis (MGCFA). Although the procedures of measurement invariance (MI) testing seem to become increasingly used by applied researchers, several issues remain under discussion and are not yet solved. For example:
(1) Can we trust models with small deviations (approximate MI)? Is partial MI sufficient? How should one deal with the lack of scalar MI, as is the case in many large-scale cross-national surveys?
(2) How to decide whether a model with a high level of MI should be preferred over a model with a lower level of MI? Which fit indices should be used?
(3) Is MI needed anyway and would it be best to start firstly with a robustness calculation?
Recent approaches have tackled the issues subsumed under (1) and aimed at relaxing certain requirements when testing for measurement invariance (Bayesian approximate MI, Muthén and Asparouhov 2012; van de Schoot et al 203) or using the alignment method (Asparouhov and Muthén 2014). Furthermore, researchers addressed the issues subsumed under (2) and recommended the use of particular fit statistics (e.g., CFI, RMSEA, SRMR) to decide among competing models (Chen 2007). The question raised under (3) is a more general one and raises concerns about the contemporary uses of the concept of MI. Researchers (Welzel and Inglehart 2016) have argued that variations in measurements across context can be ignored, for example in the presence of theoretically reasonable associations of a construct with external criteria.
This session aims at presenting studies that assess measurement invariance and/or address one of the issues listed above or related ones. We welcome (1) presentations that are applied and make use of empirical survey data, and/or that (2) take a methodological approach to address and examine measurement invariance testing and use for example Monte-Carlo simulations to study the above mentioned issues.
Keywords: measurement invariance, comparability, cross-cultural research, structural equation modeling
Ms Kristín Hulda Kristófersdóttir (University of Iceland/ Methodology Research Center, University of Iceland) - Presenting Author
Mr Hans Haraldsson (University of Iceland/ Methodology Research Center, University of Iceland)
Ms Hilma Rós Ómarsdóttir (University of Iceland/ Methodology Research Center, University of Iceland)
Ms Ragnhildur Lilja Ásgeirsdóttir (University of Iceland/ Methodology Research Center, University of Iceland)
Ms Vaka Vésteinsdóttir (University of Iceland/ Methodology Research Center, University of Iceland)
Ms Hafrún Kristjánsdóttir (Reykjavík University)
Ms Fanney Thorsdottir (University of Iceland/ Methodology Research Center, University of Iceland)
The Patient Health Questionaire (PHQ-9) is frequently used for screening of depression disorder. Currently, there is a lack of research on the measurement invariance of the PHQ-9 across gender. The purpose of this study is to test the PHQ-9 for differential item functioning (DIF) related to gender. A data set of 621 clinical participants was used, 101 males and 520 females. The data set used was from a study where participants with anxiety and/or depression symptoms were recruited to assess efficacy of treatment. Participants answered several screening lists, including PHQ-9, to evaluate changes in disorder symptoms throughout the treatment. The data used in this study consists of participants answers to the PHQ-9 before treatment. The PHQ-9 was evaluated for DIF using the the graded response model within the IRT approach. The factor structure of the PHQ-9 was confirmed using confirmatory factor analysis. The scale as a whole performed well in terms of IRT information but the results suggested that some of the items need revision. No clear evidence of DIF was found for any item in the PHQ-9 between males and females. Further reaserch is needed to establish valid use of the PHQ-9 as a screening tool for depression symptoms.
Dr Maksim Rudnev (ISCTE-IUL) - Presenting Author
Measurement invariance of constructs across many (>10) groups is rarely supported for all groups and indicators. One of the meaningful strategies in this situation is to look for subsets/clusters of groups, for which a required level of invariance holds. Alignment procedure can help in finding outlier groups but provides biased results in the presence of clusters. Multilevel mixture models are applicable however they are very complex and have limited availability. I suggest a simple method to find clusters of groups that may hold measurement invariance. The method involves k-means clusterization based on differences between group-specific factor parameters (loadings and intercepts), or, alternatively, based on MGCFA model fit indices computed for each pair of groups. An interactive R Shiny app with graphical interface facilitates development of hypotheses regarding clusters. A simulation study demonstrates that compared to the alignment procedure, the simple method is more efficient in presence of group clusters and similar in performance in detecting outlier groups.
Dr Wahideh Achbari (University of Amsterdam) - Presenting Author
Professor Eldad Davidov (University of Cologne & University of Zurich)
Generalized trust (GT) is a conspicuous indicator in studies on social cohesion. While prior research has debated the negative link between ethnic diversity and trust, only a handful of studies have so far focused on the underlying measurement issues. Contrary to conventional wisdom, a British study using think aloud protocols demonstrated that the majority of respondents high in GT think ‘most people’ refers to people they know, whereas a high proportion of those who are low in GT think about strangers (Sturgis & Smith, 2010). A study by Delhey, Newton, and Welzel (2011) employs the WVS and concludes that the trust radius is much smaller in Confucian cultures, but wider in most Western nations, without conducting any invariance tests. These results contradict findings by Sturgis and Smith (2010) since within-country differences are considered less important. In this paper, we employ the think aloud data (Sturgis & Smith, 2010) to explore the role of education in differences in reference frames. This is particularly relevant since GT has been found to consistently positively correlate with having a university degree. Intuitively, we can expect that higher educated people are more likely to see ‘most people’ in the GT question as an abstract (thus unknown) category (perhaps even out-groups). We additionally conduct formal invariance tests across all countries included in the WVS, comparing educational groups. This extends existing invariance studies, which only examine between-country differences (Reeskens & Hooghe, 2008; Van der Veld & Saris, 2011; Meuleman & Billiet, 2012; Freitag & Bauer, 2013). Moreover, we employ the alignment method (Asparouhov and Muthén 2014) to validate our results. We argue that measurement issues of GT across groups cannot be ignored, since the think aloud results (as an external criterion) suggest there is no theoretically viable argument that ‘most people’ unequivocally refers to out-groups.