Ensuring Validity and Measurement Equivalence through Questionnaire Design and Cognitive Pretesting Techniques 1 |
|
Session Organisers |
Dr Natalja Menold (GESIS) Mr Peyton M Craighill (Office of Opinion Research | U.S. Department of State) Ms Patricia Hadler (GESIS) Ms Aneta G. Guenova (Office of Opinion Research | U.S. Department of State) Dr Cornelia Neuert (GESIS) Dr Patricia Goerman (U.S. Census Bureau) |
Time | Friday 19th July, 09:00 - 10:30 |
Room | D19 |
According to the framework of the total survey error, validity refers to the degree to which survey results can be interpreted with respect to the concepts under investigation, i.e. certain opinions, behaviors, abilities, and competencies. Validity also regards interpretations about the differences or changes in the searched concepts, such as comparisons between different respondent groups, across time or cross-culturally. Survey researchers conduct studies in various languages and cultures within one country or in various countries and gather demographic, administrative and social data in these multi-cultural contexts, constantly trying to improve the accuracy of these measurements. The comparative aspects of validity have been referred to as measurement equivalence issues. Many researchers address measurement equivalence during data analysis, after data collection, often finding that there is a lack of measurement invariance. However, the sources of measurement invariance are more likely to be associated with questionnaire design and data collection processing.
Difficult questions, overloaded instructions or visual design elements can affect validity and measurement equivalence. This session aims to discuss methods of developing measurement instruments and their effect on validity and measurement equivalence. The goal is to better understand the corresponding sources of measurement error and to present methods which help to increase validity in comparative research. In particular, generic, multi-method approaches are of interest. Such approaches can include expert reviews by subject matter experts, cognitive interviews and pilot interviews with respondents who represent the main demographic groups of the target countries. In addition, quantitative analyses of findings, e.g. from experiments related to the use of different versions of questionnaires can help to evaluate the sources of decreased validity and deficient measurement equivalence.
Keywords: Questionnaire Design, Cognitive interviewing, Validity, Measurement Equivalence, Pretesting, Question Evaluation
Ms Patricia Hadler (GESIS - Leibniz Institute for the Social Sciences) - Presenting Author
Surveys interested in socially undesirable behavior are likely to include questions on the behavior itself and related attitudes. Context effects have frequently been analyzed for sensitive topics in a survey context. However, little is known about their impact on cognitive pretesting. Online probing offers an anonymous and self-administered setting to study these effects. Using questions about delinquency, this study examines the impact of question order on probe response for sensitive behavioral and attitudinal questions. Analyses combine survey and probe answers with client-side paradata.
A behavioral and attitudinal question were randomized and each followed by an open probe. A client-side paradata script collected item-level response times, answer changes, page revisits, time spent answering the probe and corrections to probe response.
Probes following behavioral questions show a higher level of non-response. However, whereas these answers do not differ strongly with question order, respondents report a more lenient attitude in the probe following the attitudinal question when the prior question asked about their behavior. Moreover, probes following a behavioral question are likely to include both behavioral and attitudinal content, whereas probes following an attitudinal question generally do not reference past behavior.
The length of probe response varies with question order for the probe following the behavioral question only. Attitudinal probe responses are associated with more text corrections in both probes. Response times for the probe mainly depend on the answer given to the survey question and the content of the probe. They correlate strongly with the response time of the survey question for respondents giving non-substantive answers.
The study analyses context effects in online pretesting of sensitive topics by combining survey and probe responses with paradata from both. Results strongly support the notion of testing question order during pretesting, as question order impacts the results of both.
Mr Xabier Irastorza (European Agency for Safety and Health at Work (EU-OSHA)) - Presenting Author
The second European Survey of Enterprises on New and Emerging Risks (ESENER-2, 2014), involved 49,320 establishments across all business size classes and activity sectors in 36 European countries. It focuseds on the management of occupational safety and health (OSH) risks in practice.
ESENER-2 expanded on the approach of ESENER-1 by including micro establishments (5-9 employees) and establishments in agriculture, forestry and fishing. The European Agency for Safety and Health at Work (EU-OSHA) undertook a review to consider the impact of this expansion of the survey universe. The review was informed by and structured around the Total Survey Error framework. This abstract focuses on its findings on measurement error in relation to the inclusion of micro establishments.
Following a review of the ESENER-2 questionnaire and an assessment of responses by establishment size, in-depth qualitative interviews were carried out with 28 micro establishments in Spain and Romania.
The findings suggested that participating micro establishments were relatively successful and OSH-confident businesses. Nevertheless, there were also indications of a mismatch between the intent of questions on key aspects of OSH arrangements and their interpretation by respondents in micro establishments. Moreover, there was some evidence that the understanding and interpretation of these questions varied with establishment size, sector, and regulatory and business contexts.
The review indicates that the inclusion of micro establishments in an international enterprise survey on OSH presents significant challenges:
• Refusals at recruitment and during the survey process.
• Respondents' understanding and interpretation of key concepts and terms, and its implications for survey development.
• Collection of sufficient contextual detail for meaningful data analysis and interpretation.
The review suggests that care must be taken to develop survey methods and content that are appropriate for all size classes. The review makes a number of recommendations for improving data collection from micro businesses in future waves of ESENER.
Dr Andre Pirralha (RECSM - Universitat Pompeu Fabra) - Presenting Author
Dr Diana Zavala-Rojas (European Social Survey ERIC - Universitat Pompeu Fabra)
Dr Wiebke Weber (RECSM - Universitat Pompeu Fabra)
Multi-item agree/disagree scales are common when collecting data on attitudes, values or beliefs of latent constructs. In order to avoid some known response biases, researchers are usually advised to include reversed worded items in their scales. The main reason usually argued to implement this questionnaire design strategy is to avoid the detrimental effects to the data quality of certain response styles such as acquiescence. However, even though the inclusion of reverse worded items can at least help to identify some types of response biases, it can also introduce some unattended consequences to validity conclusions and measurement invariance. At present, researchers have a variety of tools to know whether the data estimates do not depend on group membership and are therefore measurement invariant. But procedures to identify potential sources of measurement non-invariance are still limited.
In this paper, we rely on item response theory (IRT) to gather information about the causes of non-invariance and whether or not reversed worded items have a role on it. We use data of the European Social Survey Round 8 and the items measuring the complex concept "Perceived consequences of social policies". Six items, of which two were reverse worded, were included in the questionnaire using a 5-category agree/disagree response scale. First, we assess to what degree this concept is measurement invariant by fitting a Multi-Group Confirmatory Factor Analysis model. With the conclusions of this first step, we then fit an IRT Graded Response Model (GRM) to explore the causes of non-invariance.
The results show that the use of reverse worded items limit the cross-country measurement equivalence of the concept "Perceived consequences of social policies". We further show that the reverse worded items do not discriminate respondents on the latent opinion and that respondents with diverse opinions have the same probability of choosing the same response option.
Dr Sunghee Lee (University of Michigan) - Presenting Author
Ms Wenshan Yu (University of Michigan)
Ms Jenni Kim (University of Michigan)
Ms Maria Fernanda Alvarado Leiton (University of Michigan)
Dr Rachel Davis (University of South Carolina)
While valid assessment of subjective well-being (SWB) is at the forefront of social science research, literature increasingly reports cross-cultural noncomparability on measurement tools designed to capture SWB. Notably, typical SWB measures use multiple statements that respondents rate using Likert-type of agree-disagree response scales. One of the important and popular SWB measures is the Satisfaction with Life (SWL) Scale developed by Ed Diener. SWL Scale uses five items, all written in the direction of high satisfaction. A potential sources of measurement bias and noncomparability of the SWL Scale is response style, a tendency that respondents use certain response points in a Liker response scale regardless of the question content. For respondents with acquiescence response style (ARS), a tendency to choose “agree” responses, it becomes murky whether a high SWL score is a reflection of true life satisfaction or ARS. Comparing cultural groups on SWL scores may become futile, if the groups present different ARS tendencies.
This study examines an alternative version of SWL Scale where some of the items are written in the low satisfaction direction and compares the original and alternative SWL Scales through experiments implemented in four different surveys that targeted various racial/ethnic groups in the U.S. (Hispanics, non-Hispanic Whites, non-Hispanic Blacks, and Koreans) that are hypothesized to differ in ARS tendencies. The comparison will focus on response distributions, measurement reliability and concurrent validity between two versions of the SWL Scale.