I made a mistake. I equated a person with an experimental condition.
In Study 1 of this article, we introduced N = 54 men to both a White and a Black female confederate in two separate face-to-face interactions. These two confederates – we’ll call them “Hannah” and “Kiara” (not their real names) – played their roles superbly and never forgot their lines. The study was a model of experimental control.
But the inferences I drew from these data were incorrect because of a statistical issue I did not appreciate at the time.
|How would you label this pair of "conditions"?|
What we found was this: The men in our study (all of them White) tended to like the White confederate to the extent that they were politically conservative, but the men liked the Black confederate to the extent that they were liberal. I drew the inference that political orientation was associated with whether the men were attracted to members of their racial ingroup (i.e., the White partner) or outgroup (i.e., the Black partner).
This is why you sample stimuli as well as participants. Arguably, my sample size was not N = 54 (the number of participants), but N = 2 (the number of stimuli).
The above example may seem pretty straightforward to you, but the same issue frequently turns up in subtler—but equally problematic—forms. Let’s say I hypothesize that attractiveness inspires romantic desire more for men than for women in a face-to-face, heterosexual interaction. This makes intuitive sense…anecdotally, men seem to talk more about how hot women are than vice versa. Perhaps surprisingly, then, this sex difference does not emerge in speed-dating contexts where people meet a slew of opposite-sex partners who naturally vary in attractiveness (see here and direct replication here. But maybe it would emerge with a manipulation of attractiveness: If men and women each met an attractive and an unattractive partner, maybe this within-subjects attractiveness manipulation would inspire romantic desire more for men than for women?
|From Li et al. (2013). Each bar was |
generated by 42-51 raters but only 2 targets.
Here’s a study that used exactly this approach to test the hypothesis that attractiveness will matter more for inspiring romantic desire in men than in women. It seems to find—and is frequently cited as showing—evidence for the hypothesized sex difference: In the figure on the right, one can clearly see that men differentiated the attractive and unattractive confederates much more strongly than women did.
But notice that this study has the same serious flaw that I described above with my confederate study. To see why, let’s once again use (fake) names: The men desired Rachel and Sally much more than Amanda and Liz, whereas women desired Brian and Karl just a bit more than James and Dan. The results certainly tell us something about the desirability of these particular confederates. But with such a small N (only 2 confederates per condition), we cannot generalize these findings to say anything meaningful about attractive and unattractive targets in general.
|What is the N of this design: 93 or 8?|
What happens if you regularly equate confederate with condition and use small samples of stimuli? Your effect size estimates will tend to be extremely unstable. Consider this study, which used N = 389 participants but only 10 male and 11 female confederates. They found an enormous sex difference in the opposite direction from the study described above: Confederate attractiveness affected women’s romantic desire much more strongly than men’s. If you were including this study in a meta-analysis, it would be more appropriate to assign it a N of 21 rather than 389 to reflect the imprecision of this particular sex-difference estimate.
So what to do? Power calculations with these designs are complex, but a good start would be to use at least N = 40 or 50 stimuli per condition and treat stimuli as a random factor. Then, any incidental differences between the experimental stimuli would likely wash out, and we could be reasonably confident that any effects of the “manipulation” were truly due to attractiveness. Yes, that’s probably too many stimuli for a study involving live confederates, so you may need to get creative—for example, many speed-dating studies provide this kind of statistical power. 
It’s easy to get tripped up by this issue, especially when you have confederates that you’ve carefully selected to differ in an obvious way. But don’t make the mistake. If a confederate is nested within condition in your design, you likely need to reconsider your design.
For more information about stimulus sampling challenges, see detailed discussions by Wells and Windschitl and Westfall and colleagues, as well as this paper that describes stimulus sampling challenges when studying sex differences in particular.
 Study 2 of the same paper replicated this interaction using N = 2,781 White participants and N = 24,124 White and Black targets, which allows us to have more confidence in the inference that this interaction is about race rather than peculiarities of particular stimuli. Nevertheless, I assure you that at the time, I would have tried to publish the two-confederate study on its own had I not had access to this larger Study 2 sample.
 Alternatively, you could manipulate the attractiveness of a single confederate (e.g., using makeup and clothing); at least one study has successfully done so (see Figure 1 here), although we have found executing such a manipulation to be challenging in our lab.