I made a mistake. I equated a person with an experimental
condition.
In Study 1 of this article, we
introduced N = 54 men to both a White
and a Black female confederate in two separate face-to-face interactions. These
two confederates – we’ll call them “Hannah” and “Kiara” (not their real names)
– played their roles superbly and never forgot their lines. The study was a model
of experimental control.
But the inferences I drew from these data were incorrect
because of a statistical issue I did not appreciate at the time.
How would you label this pair of "conditions"? |
What we found was this: The men in our study (all of them
White) tended to like the White confederate to the extent that they were
politically conservative, but the men liked the Black confederate to the extent
that they were liberal. I drew the inference that political orientation was
associated with whether the men were attracted to members of their racial ingroup
(i.e., the White partner) or outgroup (i.e., the Black partner).
This is why you sample stimuli as well as participants. Arguably, my sample size was not N = 54 (the number of participants), but
N = 2 (the number of stimuli).
===============
The above example may seem pretty straightforward to you,
but the same issue frequently turns up in subtler—but equally problematic—forms.
Let’s say I hypothesize that attractiveness inspires romantic desire more for
men than for women in a face-to-face, heterosexual interaction. This makes
intuitive sense…anecdotally, men seem to talk more about how hot women are than
vice versa. Perhaps surprisingly, then, this sex difference does not emerge in
speed-dating contexts where people meet a slew of opposite-sex partners who
naturally vary in attractiveness (see here and direct
replication here). But maybe it would emerge with a manipulation of attractiveness: If men
and women each met an attractive and an unattractive partner, maybe this within-subjects
attractiveness manipulation would inspire romantic desire more for men than for
women?
From Li et al. (2013). Each bar was generated by 42-51 raters but only 2 targets. |
Here’s
a study that used exactly this approach to test the hypothesis that attractiveness
will matter more for inspiring romantic desire in men than in women. It seems to find—and is frequently cited as
showing—evidence for the hypothesized sex difference: In the figure on the right,
one can clearly see that men differentiated the attractive and unattractive
confederates much more strongly than women did.
But notice that this study has
the same serious flaw that I described above with my confederate study. To see
why, let’s once again use (fake) names: The men desired Rachel and Sally much
more than Amanda and Liz, whereas women desired Brian and Karl just a bit more
than James and Dan. The results certainly tell us something about the
desirability of these particular confederates. But with such a small N (only 2
confederates per condition), we cannot generalize these findings to say
anything meaningful about attractive and
unattractive targets in general.
What is the N of this design: 93 or 8? |
What happens if you regularly equate confederate with
condition and use small samples of stimuli? Your effect size estimates will tend
to be extremely unstable. Consider this
study, which used N = 389 participants but only 10 male and 11 female
confederates. They found an enormous sex difference in the opposite direction from the study
described above: Confederate attractiveness affected women’s romantic
desire much more strongly than men’s. If you were including this study in a meta-analysis, it
would be more appropriate to assign it a N of 21 rather than 389 to reflect the
imprecision of this particular sex-difference estimate.
So what to do? Power calculations with these designs are complex, but
a good start would be to use at least N = 40 or 50 stimuli per condition and
treat stimuli as a random factor. Then, any incidental differences between the experimental
stimuli would likely wash out, and we could be reasonably confident that any effects
of the “manipulation” were truly due to attractiveness. Yes, that’s probably
too many stimuli for a study involving live confederates, so you may need to
get creative—for example, many speed-dating studies provide this kind of
statistical power. [2]
It’s easy to get tripped up by this issue, especially when
you have confederates that you’ve carefully selected to differ in an obvious
way. But don’t make the mistake. If a confederate is nested within condition in
your design, you likely need to reconsider your design.
For more information about stimulus sampling challenges, see
detailed discussions by Wells and
Windschitl and Westfall
and colleagues, as well as this paper that
describes stimulus sampling challenges when studying sex differences in
particular.
[1]
Study 2 of the same paper replicated this interaction using N = 2,781 White participants
and N = 24,124 White and Black targets, which allows us to have more confidence
in the inference that this interaction is about race rather than peculiarities
of particular stimuli. Nevertheless, I assure you that at the time, I would
have tried to publish the two-confederate study on its own had I not had access
to this larger Study 2 sample.
[2]
Alternatively, you could manipulate the attractiveness of a single confederate
(e.g., using makeup and clothing); at least one study has successfully done so
(see Figure 1 here),
although we have found executing such a manipulation to be challenging in our
lab.
Great blog Paul. I will assign it to my grad Anova class. Three related questions:
ReplyDelete1. Is the issue primarily with nesting or is it with the fact that stimuli comprise a small sample from a population of potential stimuli? In the latter case, having a confederate in all conditions (so they're crossed rather than nested) doesn't go very far to solve the problem.
2. Do you think the issue is equally relevant to any study with stimuli? In our research on correlating sexual arousal patterns and sexual orientation, we use a small number of erotic stimuli that vary by type (within-subjects factor); we typically employ two of each type. Researchers in other labs typically use different exemplars of the stimuli, and we generally all get similar results. These are often large effects though.
3. If nesting isn't the only issue (#1), then what distinguishes the need to treat stimuli as a random effect versus manipulation (e.g., in a psychology experiment) as a random effect, from the population of manipulations one might employ to test a conceptual effect? I suppose that random effects meta-analysis has this idea built in, but one could do it at the study level too.
I'm sure you know this article, but it's relevant: http://jakewestfall.org/publications/JWK.pdf
Thanks! And great questions...
Delete1. I think it's primarily "the fact that stimuli comprise a small sample from a population...". If you had the same confederate in all conditions, presumably your particular confederate isn't of substantive interest anymore. Rather, your manipulation of substantive interest would be something that confederate is doing or saying - the "doing" or "saying" is now your IV instead of "Jim vs. Bob". (But see #3 below.)
2. I think it *could* be the case that your findings are limited to the stimuli that you happened to use. In some cases, that's perfectly ok - the stimuli might be "Clinton" and "Trump", but your conceptual variable is linked to those two specific people, not politicians in general. But that's probably the exception that proves the rule?
3. Yeah, this is a great point - you are getting at the essence of conceptual replication. So yes, we know to operationalize our variables in different ways across studies to make sure a finding isn't restricted to one particular measure or manipulation. But does this mean that we should be imagining ourselves sampling from a "population" of different possible manipulations? Maybe? I'd need to think about this more, but it's a mind-bend-y possibility (and it is certainly one reason that a random effects meta-analysis would be appropriate).