All time references are in CEST
Failures in Survey Experiments 2 |
|
Session Organisers | Dr Kristin Kelley (WZB Berlin Social Science Center) Professor Lena Hipp (WZB Berlin Social Science Center/University of Potsdam) |
Time | Wednesday 19 July, 16:00 - 17:30 |
Room | U6-01f |
What can failures and unexpected results tell us about survey experimental methodology?
In recent years, social scientists have increasingly relied on survey experiments to estimate causal effects. As with any experimental design, survey experiments can fail or yield results that researchers did not anticipate or preregister. Researchers find themselves in situations with null, unexpected, inconsistent, or inconclusive results and must then decide whether results reflect on the theory being tested, the experimental design, or both. Usually, the insights gained from such failures are not widely shared, even though they could be very useful to improve the design of experiments, quality of research, and transparency.
We propose a session in which scholars present insights from survey experiments that failed or led to unexpected results. We believe that sharing the design and results from failed survey experiments, carefully considering their possible flaws, and talking about unexpected findings is useful to the development of theory (e.g., identifying scope conditions) and methods, and contributions to the transparent research practices.
We invite contributions that address the following: What’s the appropriate response to a “failed” experiment or to deal with unexpected results and null findings? More specifically, the objective of the session will be to reflect on the definition of a “failed survey experiment” (e.g., null findings, unexpected findings, problems during the conduct of the survey experiments, findings that contradict field-experimental evidence and theoretical/pre-registered predictions). We will discuss why some experiments fail (e.g., poor treatments/manipulations, unreliable/invalid dependent measurements, underdeveloped theory and/or hypotheses, underpowered), whether and how to interpret and publish results from failed survey experiments, what was learned from failed survey experiments, and recommendations for survey experiment methodology.
Keywords: survey experiments, null-findings, pre-registration, experimental design, sample size requirements
Dr Sebastian Wenz (GESIS -- Leibniz Institute for the Social Sciences) - Presenting Author
To overcome problems of observational designs in identifying causal effects, researchers in the social and behavioral sciences have frequently used survey experiments and similar designs to identify racial and ethnic discrimination in various contexts. However, given usual definitions of racial and ethnic discrimination – e.g., as the causal effect of an ethnic/racial signal sent out by an individual on how this individual is treated – many commonly implemented factorial survey designs fail to identify ethnic/racial discrimination.
In my contribution, I show that survey experiments on ethnic/racial discrimination, when using names or pictures usually apply treatments that potentially carry both ethnic/racial and social class signals. Thus, these studies will typically not identify ethnic/racial discrimination according to the definition above – or other popular definitions, for that matter – but confound ethnic/racial with social class discrimination. Put differently, the challenge for researchers setting up a survey experiment to study ethnic/racial discrimination is to construct an ethnicity or race variable as treatment whose values send varying ethnic signals but remain constant on their social class signal.
Using Directed Acyclic Graphs (DAGs), I illustrate this problem for the popular approach of using (first and/or family) names as ethnicity/race stimuli. I discuss different possible solutions but focus on the selection of names that hold the social class signal constant. I show that and how it is possible to select such a set of names for a survey experiment but also that this approach has limitations. I will conclude with a brief discussion on the question whether the causal effect defined above is the effect we are really interested in and point to some other failures when identifying and estimating ethnic/racial discrimination using survey experiments (e.g., related to the sampling process).
Dr Sara Möser (University of Bern) - Presenting Author
Ms Madlaina Jost (University of Bern)
Using the evaluation of hypothetical job offers we analyse gender specific preferences for work arrangements.
In a discrete choice experiment, implemented in the tenth survey wave of the DAB Panel Study, a sample of young adults are asked to evaluate four sets of job offers. In each set, they receive two offers in their occupational field that differ only concerning salary, working hours, working flexibility, support for further training, opportunities for professional advancement and working atmosphere. Respondents are asked to indicate, on the one hand, which job they find more attractive and, on the other hand, which they would choose if they had the choice of rejecting both.
The analysis of forced and unforced choice using linear probability modelling leads to similar results regarding the evaluation of working arrangements. However, when differentiated by gender and individual context (family formation intention and employment situation), substantial differences between the forced and unforced models can be found, specifically regarding the valuation of part-time positions and support for further training.
Generally it is recommended to give participants the option to reject both alternatives. Thereby, the opt-out option is framed as a status quo alternative. In our experiment on job choice, however, we did not specify this status quo and did not elaborate on the consequences of rejecting both offers. The experimental condition asked respondents to imagine that they were currently looking for a new position, however, it is unclear whether rejecting both offers would result in hypothetical unemployment, staying at the current job, or some other situation. Using information on the current employment and training situation of our respondents, we analyse these status quo effects.
We would be pleased to elaborate on our findings in Milan and thereby contribute to a deeper understanding of the implications of design choices in choice experiments.
Professor Lena Hipp (WZB Berlin/University of Potsdam)
Ms Sandra Leumann (WZB Berlin Social Science Center) - Presenting Author
In order to test the effect of a randomized variable X on an outcome of interest Y, researchers have to ensure that the manipulated X indeed conveys the concept that it is intended to convey. This presentation will report about a failed manipulation in a survey experiment about dating preferences.
The guiding research question of this project was: Do men and women in gender-atypical jobs have worse chances of finding a partner than individuals in gender-typical jobs? To answer this question, we conducted several studies, one of which was a factorial survey experiment, which failed because respondents did not react to the manipulation of our independent variable of interest (X), the gender-typicality of an occupation, the way we expected. To generate male vs. female-typed occupations, we relied on the actual segregation figures for particular occupation, i.e., whether an occupation was female-dominated, male dominated, or an approximately equal number of men and women worked in a particular occupation. Thanks to a manipulation check in which we asked respondents about their assessment of how many men vs. women worked in a particular occupation, we found that respondents did rarely know which occupations were male or female-dominated.
With our conference presentation, we pursue the following goals: First, we will show the degree of the mis-perceived gender-typicality of the different occupations that we used in our study (conducted in December 2019 on an online quota sample in Germany) and analyze the factors that potentially lead to such mis-perceptions. Second, we discuss how researchers can deal with failed manipulations when having pre-registered their experiments. Third, we present practical insights into what researchers should consider when they are interested in assessing the effects of gender-typical vs. gender-atypical occupations on any outcome of interest.
Dr Sebastian Rinken (Institute for Advanced Social Studies, Spanish Research Council (IESA-CSIC)) - Presenting Author
Dr Sara Pasadas-del-Amo (Institute for Advanced Social Studies, Spanish Research Council (IESA-CSIC))
Mr Manuel Trujillo (Institute for Advanced Social Studies, Spanish Research Council (IESA-CSIC))
There is ample evidence that with regard to sensitive issues, direct questions are prone to elicit incorrect scores. List experiments aim to avoid such distortions (“social desirability bias”) by dividing the sample aleatorily in two groups and presenting both with identical lists – except for addition of the sensitive item as treatment. Respondents are asked just how many, but not which, items apply to them. When technical safeguards are implemented correctly, respondents perceive their anonymity to be guaranteed; hence, those considering the treatment item applicable are supposed to mark correct scores.
However, a growing number of studies have reported list experiments to yield unstable or, more intriguingly, even counter-intuitive results. Recent meta-analyses reveal the technique to fail precisely when it ought to excel, namely, with regard to highly sensitive issues in general (Ehler, Wolter & Junkermann, 2021) and prejudiced attitudes, in particular (Blair, Coppock & Moor, 2020). Scant differences vis-á-vis list-based prevalence estimates have led some observers to conclude that direct questions capture prejudiced attitudes rather well, after all.
We report on a ”failed” list experiment on anti-immigrant sentiment that was fielded in Spain (n=1,965): on aggregate, the list-based and an analogous direct estimator obtained similar results. However, in specific respondent categories, the list-based estimate was negative – i.e., the treatment-group’s mean score was lower than the corresponding control group’s. After pondering all possible options, we conclude that strategic response error is the most plausible explanation – i.e., that artificially low scores were triggered by confrontation with the sensitive item. Expanding on Zigerell’s (2011) work on deflation, we suggest specific scenarios that make such respondent behavior intelligible in the study context. The paper’s upshot is that “failed” experiments may convey crucial lessons, both methodologically and substantively.