Monday, January 22, 2018

A Confederate is not a Condition

I made a mistake. I equated a person with an experimental condition.

In Study 1 of this article, we introduced N = 54 men to both a White and a Black female confederate in two separate face-to-face interactions. These two confederates – we’ll call them “Hannah” and “Kiara” (not their real names) – played their roles superbly and never forgot their lines. The study was a model of experimental control.

But the inferences I drew from these data were incorrect because of a statistical issue I did not appreciate at the time.

How would you label this pair of "conditions"?
What we found was this: The men in our study (all of them White) tended to like the White confederate to the extent that they were politically conservative, but the men liked the Black confederate to the extent that they were liberal. I drew the inference that political orientation was associated with whether the men were attracted to members of their racial ingroup (i.e., the White partner) or outgroup (i.e., the Black partner).

But a logically equivalent description of these results reveals my inferential overreach: The men in our study liked Hannah more to the extent that they were politically conservative, but they liked Kiara more to the extent that they were liberal. The results might have been attributable to the women’s race…or to any of the other myriad differences between these two particular women.[1]

This is why you sample stimuli as well as participants.  Arguably, my sample size was not N = 54 (the number of participants), but N = 2 (the number of stimuli).


The above example may seem pretty straightforward to you, but the same issue frequently turns up in subtler—but equally problematic—forms. Let’s say I hypothesize that attractiveness inspires romantic desire more for men than for women in a face-to-face, heterosexual interaction. This makes intuitive sense…anecdotally, men seem to talk more about how hot women are than vice versa. Perhaps surprisingly, then, this sex difference does not emerge in speed-dating contexts where people meet a slew of opposite-sex partners who naturally vary in attractiveness (see here and direct replication here). But maybe it would emerge with a manipulation of attractiveness: If men and women each met an attractive and an unattractive partner, maybe this within-subjects attractiveness manipulation would inspire romantic desire more for men than for women?

From Li et al. (2013). Each bar was
generated by 42-51 raters but only 2 targets.
Here’s a study that used exactly this approach to test the hypothesis that attractiveness will matter more for inspiring romantic desire in men than in women.  It seems to find—and is frequently cited as showing—evidence for the hypothesized sex difference: In the figure on the right, one can clearly see that men differentiated the attractive and unattractive confederates much more strongly than women did. 

But notice that this study has the same serious flaw that I described above with my confederate study. To see why, let’s once again use (fake) names: The men desired Rachel and Sally much more than Amanda and Liz, whereas women desired Brian and Karl just a bit more than James and Dan. The results certainly tell us something about the desirability of these particular confederates. But with such a small N (only 2 confederates per condition), we cannot generalize these findings to say anything meaningful about attractive and unattractive targets in general.

What is the N of this design: 93 or 8?
The problem here is that stimuli (in this case, confederates) are nested within condition, just like participants are nested within condition in a between-subjects design. In order to generalize our results beyond the specific people who happen to be in our sample, we have to treat participant as a random factor in our designs. The same logic applies to stimuli: When they are nested within condition, we need to treat stimuli (e.g., confederates) as random factors because we want to generalize the beyond the 2 or 4 or 8 confederates who happened to be part of our study.

What happens if you regularly equate confederate with condition and use small samples of stimuli? Your effect size estimates will tend to be extremely unstable. Consider this study, which used N = 389 participants but only 10 male and 11 female confederates. They found an enormous sex difference in the opposite direction from the study described above: Confederate attractiveness affected women’s romantic desire much more strongly than men’s. If you were including this study in a meta-analysis, it would be more appropriate to assign it a N of 21 rather than 389 to reflect the imprecision of this particular sex-difference estimate.

So what to do? Power calculations with these designs are complex, but a good start would be to use at least N = 40 or 50 stimuli per condition and treat stimuli as a random factor. Then, any incidental differences between the experimental stimuli would likely wash out, and we could be reasonably confident that any effects of the “manipulation” were truly due to attractiveness. Yes, that’s probably too many stimuli for a study involving live confederates, so you may need to get creative—for example, many speed-dating studies provide this kind of statistical power. [2]

It’s easy to get tripped up by this issue, especially when you have confederates that you’ve carefully selected to differ in an obvious way. But don’t make the mistake. If a confederate is nested within condition in your design, you likely need to reconsider your design.

For more information about stimulus sampling challenges, see detailed discussions by Wells and Windschitl and Westfall and colleagues, as well as this paper that describes stimulus sampling challenges when studying sex differences in particular.

[1] Study 2 of the same paper replicated this interaction using N = 2,781 White participants and N = 24,124 White and Black targets, which allows us to have more confidence in the inference that this interaction is about race rather than peculiarities of particular stimuli. Nevertheless, I assure you that at the time, I would have tried to publish the two-confederate study on its own had I not had access to this larger Study 2 sample.

[2] Alternatively, you could manipulate the attractiveness of a single confederate (e.g., using makeup and clothing); at least one study has successfully done so (see Figure 1 here), although we have found executing such a manipulation to be challenging in our lab.


  1. Great blog Paul. I will assign it to my grad Anova class. Three related questions:

    1. Is the issue primarily with nesting or is it with the fact that stimuli comprise a small sample from a population of potential stimuli? In the latter case, having a confederate in all conditions (so they're crossed rather than nested) doesn't go very far to solve the problem.

    2. Do you think the issue is equally relevant to any study with stimuli? In our research on correlating sexual arousal patterns and sexual orientation, we use a small number of erotic stimuli that vary by type (within-subjects factor); we typically employ two of each type. Researchers in other labs typically use different exemplars of the stimuli, and we generally all get similar results. These are often large effects though.

    3. If nesting isn't the only issue (#1), then what distinguishes the need to treat stimuli as a random effect versus manipulation (e.g., in a psychology experiment) as a random effect, from the population of manipulations one might employ to test a conceptual effect? I suppose that random effects meta-analysis has this idea built in, but one could do it at the study level too.

    I'm sure you know this article, but it's relevant:

    1. Thanks! And great questions...

      1. I think it's primarily "the fact that stimuli comprise a small sample from a population...". If you had the same confederate in all conditions, presumably your particular confederate isn't of substantive interest anymore. Rather, your manipulation of substantive interest would be something that confederate is doing or saying - the "doing" or "saying" is now your IV instead of "Jim vs. Bob". (But see #3 below.)

      2. I think it *could* be the case that your findings are limited to the stimuli that you happened to use. In some cases, that's perfectly ok - the stimuli might be "Clinton" and "Trump", but your conceptual variable is linked to those two specific people, not politicians in general. But that's probably the exception that proves the rule?

      3. Yeah, this is a great point - you are getting at the essence of conceptual replication. So yes, we know to operationalize our variables in different ways across studies to make sure a finding isn't restricted to one particular measure or manipulation. But does this mean that we should be imagining ourselves sampling from a "population" of different possible manipulations? Maybe? I'd need to think about this more, but it's a mind-bend-y possibility (and it is certainly one reason that a random effects meta-analysis would be appropriate).