Wednesday, March 7, 2018

Going on the record via preregistration

A public commitment to update my own beliefs in response to a planned analysis I haven’t seen yet (Part 1)

In my previous post, I talked about how essential it is that we, as scientists, remain open to the possibility of having our intuitions disconfirmed.

Now let’s see if I can put my money where my mouth is.

If I take my own admonishment seriously, I need to be willing to have my own intuitions and beliefs disconfirmed—even when those beliefs have developed through years of researching a particular topic.

Here’s one of my own findings in which I have a high degree of confidence. In a meta-analysis I conducted about five years ago, we examined whether a partner’s attractiveness was more romantically appealing to men than to women. We acquired a large collection of published and unpublished datasets (k = 97, N = 29,780) that spanned a variety of paradigms in which men and women reported on partners they had (at a minimum) met face-to-face. Overall, we found that the sex difference in the appeal of attractiveness was not significantly different from zero, and it did not matter whether the study examined initial attraction (e.g., speed-dating, confederate designs) or established relationships (e.g., dating couples, married couples).

Here is a hypothetical illustration of this finding: If a man’s satisfaction in a given relationship is predicted by his female partner’s attractiveness at r = .08, we might find that a woman’s satisfaction is predicted by her male partner’s attractiveness at about r = .03. Meta-analytically, the sex difference is about this size: r(difference) = .05 or smaller. You can interpret this r(difference) like you would interpret r = .05 in any other context—really small, hard to detect, and probably not practically different from zero.
However you slice the meta-analytic data, it is hard to find a sex difference in the appeal
of attractiveness in paradigms where participants have met partners face-to-face.
(p refers to the p value of the sex difference test statistic Qsex.) From here.

Interestingly, the sex difference in attractiveness is much larger when you ask men and women to use a rating scale to indicate how much they think they like physical attractiveness in a partner. The size of this “stated preference” sex difference is about r = .25 (see Table 1 in this paper). [1]

In other words, an r = .25 effect when people make judgments about what they think they like drops to r = .05 when people are responding to partners who they have actually met in real life. 

I find this “effect size drop” deeply fascinating. It opens two interesting questions that have guided much of my research:

1. If men and women truly differ in the extent to which they believe attractiveness to be important in a partner, what factors interfere with the application of these ideals when they evaluate partners in real life?

2. If there is essentially no difference between men and women in how much they actually prefer attractiveness in a real life partner, what sorts of social-cognitive biases might produce the sex difference in how much people think they prefer attractiveness in a partner?

I have spent considerable time and effort in the last decade examining these two questions in my research. We’ve found some answers, and yet there’s still a long way to go in this topic area.

All effect sizes are coded so positive values mean that attractiveness receives higher
ratings/is a larger predictor for men than for women. I am prepared to update the
table after I examine the new McNulty et al. (in press) data according to my
preregistered analysis plan.
But back to my belief that I am putting on the line in this blog post: I believe that the sex difference is about r = .05 (or smaller) when people evaluate real-life partners. I feel pretty confident about this belief, given all the evidence I have seen. But there are other scholars who believe something entirely different.


Since we published the meta-analysis, two empirical articles have taken a strong stance against our conclusion that the sex difference in the appeal of attractiveness is small or nonexistent. I discussed one of them (Li et al., 2013) in an earlier post; given the tiny effective sample size of that study, I won’t discuss it further here. Instead, let’s talk about the second one: Meltzer, McNulty, Jackson, & Karney (2014).

This paper found the expected sex difference in a sample of N = 458 married couples. In brief, they found that women’s attractiveness predicted men’s satisfaction at r = .10, whereas men’s attractiveness predicted women’s satisfaction r = -.05. That’s an r(difference) of .15—still pretty small, but not zero (p = .046).

One unusual element of this paper is that the authors only present this sex difference in one analysis, and it included a large number of covariates. Twenty-eight of them, to be exact. Another element worth noting is that there were actually two ways that the sex difference could have emerged—on the intercept of satisfaction or the slope of satisfaction. The effect that the authors focused on was the intercept; slope effects did not differ for men and women, r(difference) = .02.

Personally, I don’t believe that this analysis provides an accurate depiction of the sex difference. It’s hard for me to buy into the idea that you need twenty-eight covariates in this analysis, and even then, the sex difference only emerges in one place and not the other. In fact, we conducted an identical analysis on some of our own data that had the same variables, and we didn’t find a hint of the sex difference (if anything, the slope effect trended in the opposite direction).

Nevertheless, for the past five years, this debate gets distilled to “Team X says no sex difference, but Team Y says yes.” If someone wants to cite evidence for the absence of the sex difference, they have it; if someone wants to cite evidence for the presence of the sex difference, they can do that, too. This does not seem to be a good scientific recipe for getting closer to the truth.

I’m pretty confident in my belief that the sex difference here is tiny or nonexistent. But you know what? Maybe I’m wrong. If I want to call myself a scientist, I have to be open to that possibility. I have to be willing to say: Here are the data that would convince me to change my belief.

So here it is: I will update my belief if a preregistered test, using the same 28-covariate analysis in a new dataset, replicates the sex difference on the intercept found in Meltzer et al. (2014).

You may be thinking, it’s easy for me to say that, so long as no dataset of the kind exists. But in fact, just the other day, I saw this new published paper (McNulty, Meltzer, Makhanova, & Maner, in press). It primarily examines a different (and totally fascinating!) research question, and it uses a new sample of N = 233 couples. But buried in the descriptions of the covariates in that paper are all of the key variables and all but one of the covariates required to directly replicate the earlier sex difference analysis reported in Meltzer et al. (2014).

Here is what I am committing to, publicly, right now: I have written up a preregistered analysis plan that provides the test I outline above. I will email Jim McNulty for the data they used in this new published manuscript, which I am confident that he will share with me. I will run the preregistered analysis on these data, and I will describe the results as a “Part 2” of this blog post. If the key finding from Meltzer et al. (2014) replicates—that is, if the sex difference on the intercept is significant—then I need to seriously consider the possibility that I am wrong, and I need to update my beliefs accordingly. If it is not, I hope that those scholars who believe in this particular sex difference will be willing to update their beliefs and/or conduct a highly powered test of their prediction.  

Either way, we’ll be getting closer to the truth rather than being stuck in an endless circle around it.

[1] When people talk about the “robust literature” showing that attractiveness matters more to men than to women, they could be talking about one of two things. First, they could be talking about this stated preference sex difference. Second, they might be talking about findings showing that, in hypothetical settings (e.g., viewing photographs), attractiveness tends to matter more to men than to women. In fact, we preregistered a study examining this context and found the sex difference! As I described in this earlier post, the size of the sex difference that we found in a very highly powered design was r = .13. 

Tuesday, February 20, 2018

Intuitive and Impossible: What do Short-Term and Long-Term Relationships Look Like?

People have long-term relationships and short-term relationships. In what ways do these two kinds of relationships differ?

You may find the answer to be extremely intuitive—or extremely counterintuitive—depending on your lay theories about relationships, or depending on which segment of the literature on human mating is more familiar to you.

The ReCAST Model. Double lines are long-term relationships,
and the single line is a short-term relationship.
In a recent paper, we collected data on people’s real-life relationships over time—beginning at the first moment they met a partner—to compare the relationships that people think of as “long-term” and “short-term.” There is a vast literature that asks people what they want in these kinds of relationships, but there is far less data on people’s real life experiences with short-term and long-term relationships and partners. We wanted to know: How exactly do these types of relationships differ, and when do these differences become apparent? It took us about 4 years to collect and publish these data, and they helped us inform and develop something we call the ReCAST model.

Perhaps the most important finding was this one: Differences did not emerge right away. That is, it took a considerable period of time—typically weeks or months—for short-term and long-term relationships to diverge. Put another way: You can’t tell, early on, whether a relationship is short-term or long-term; the trajectories only pull apart once you’ve known someone for quite awhile.

We have a high degree of confidence in these findings.[1] But here is today’s question: Are these findings intuitive and obvious? 

According to one type of reviewer (we had two reviewers like this), these data are extremely intuitive. These reviewers said: Researchers studying close relationships already know that relationships unfold gradually over time. Of course you cannot predict how long a relationship will last until two people have a chance to interact, assess interpersonal chemistry, and (preferably) have a few make-out sessions. These assumptions are built into the fabric of everything we have done for the past 30 years. Why would you try to test or publish something so obvious?

To another type of reviewer (we had four reviewers like this), these results were highly implausible. These reviewers said: Researchers studying evolved strategies know that people approach relationships very differently depending on whether that relationship is short-term or long-term. For example, women can view a photograph of a man and know from his chiseled features that he is good for a short-term but not a long-term relationship. Your data are at odds with the assumptions that are built into the fabric of everything we have done for the past 30 years. You can’t possibly be testing these predictions correctly—if your methods were right, you would have gotten different results. Therefore, these data shouldn’t be published.

Together, these reviews characterized our data as simultaneously obvious and implausible. And this juxtaposition highlights the risk of drawing on intuition when making scientific critiques.


Here is a short history of the Pendulum of Intuitiveness in psychological journals.

When I was in graduate school in the early-mid 2000s, the easiest way to get rejected from a journal was to try to publish something that felt obvious and familiar. One way that people would try to combat this pressure: Find a result that was counterintuitive. Hopefully, very counterintuitive. Like “wow, can you believe it?!” counterintuitive.

Sometimes, though, that counterintuitive finding didn’t emerge from a deep dive into two theories to discover where they made divergent predictions. Rather, the finding was something flashy—something a lay person wouldn’t have expected. Conducting data analysis felt more like gambling than detective work; ten obvious p < .05s were worth a lot less than one shocking (and perhaps “lucky”) p < .05. These pressures and strategies probably led to the publication of some counterintuitive findings that would be tough to replicate over some intuitive but easily replicable ones.

But within the last few years, terms like “counterintuitive” have become radioactive in the wake of recent methodological advances in our field. In other words, if a result seems surprising to you, now there is reason to suspect that it might be “too good to be true.”

The counterintuitive backlash makes sense. But it’s not a sufficient place to stop: Unless we want to keep swinging with the pendulum, we have to remember to continually question our intuitions at the same time. If we’re not willing to test our intuitions and publish the results—whether those results are themselves intuitive or counterintuitive—we sound more like advocates for “stuff we already know” than scientists asking questions about the world.

So intuition may be great for inspiring study ideas and informing your own personal Bayesian priors about whether a study is likely to work or replicate. But it is not a substitute for actual empirical research. And if that research is appropriately-powered, theoretically grounded, and well conducted, the findings have value regardless of whether they happened to confirm or disconfirm your intuitions. After all, one scholar’s intuitive may be another scholar’s impossible.


[1] Please, please replicate us! The materials and preregistration can be found here. And don’t hesitate to email me if you have questions.

Monday, January 22, 2018

A Confederate is not a Condition

I made a mistake. I equated a person with an experimental condition.

In Study 1 of this article, we introduced N = 54 men to both a White and a Black female confederate in two separate face-to-face interactions. These two confederates – we’ll call them “Hannah” and “Kiara” (not their real names) – played their roles superbly and never forgot their lines. The study was a model of experimental control.

But the inferences I drew from these data were incorrect because of a statistical issue I did not appreciate at the time.

How would you label this pair of "conditions"?
What we found was this: The men in our study (all of them White) tended to like the White confederate to the extent that they were politically conservative, but the men liked the Black confederate to the extent that they were liberal. I drew the inference that political orientation was associated with whether the men were attracted to members of their racial ingroup (i.e., the White partner) or outgroup (i.e., the Black partner).

But a logically equivalent description of these results reveals my inferential overreach: The men in our study liked Hannah more to the extent that they were politically conservative, but they liked Kiara more to the extent that they were liberal. The results might have been attributable to the women’s race…or to any of the other myriad differences between these two particular women.[1]

This is why you sample stimuli as well as participants.  Arguably, my sample size was not N = 54 (the number of participants), but N = 2 (the number of stimuli).


The above example may seem pretty straightforward to you, but the same issue frequently turns up in subtler—but equally problematic—forms. Let’s say I hypothesize that attractiveness inspires romantic desire more for men than for women in a face-to-face, heterosexual interaction. This makes intuitive sense…anecdotally, men seem to talk more about how hot women are than vice versa. Perhaps surprisingly, then, this sex difference does not emerge in speed-dating contexts where people meet a slew of opposite-sex partners who naturally vary in attractiveness (see here and direct replication here). But maybe it would emerge with a manipulation of attractiveness: If men and women each met an attractive and an unattractive partner, maybe this within-subjects attractiveness manipulation would inspire romantic desire more for men than for women?

From Li et al. (2013). Each bar was
generated by 42-51 raters but only 2 targets.
Here’s a study that used exactly this approach to test the hypothesis that attractiveness will matter more for inspiring romantic desire in men than in women.  It seems to find—and is frequently cited as showing—evidence for the hypothesized sex difference: In the figure on the right, one can clearly see that men differentiated the attractive and unattractive confederates much more strongly than women did. 

But notice that this study has the same serious flaw that I described above with my confederate study. To see why, let’s once again use (fake) names: The men desired Rachel and Sally much more than Amanda and Liz, whereas women desired Brian and Karl just a bit more than James and Dan. The results certainly tell us something about the desirability of these particular confederates. But with such a small N (only 2 confederates per condition), we cannot generalize these findings to say anything meaningful about attractive and unattractive targets in general.

What is the N of this design: 93 or 8?
The problem here is that stimuli (in this case, confederates) are nested within condition, just like participants are nested within condition in a between-subjects design. In order to generalize our results beyond the specific people who happen to be in our sample, we have to treat participant as a random factor in our designs. The same logic applies to stimuli: When they are nested within condition, we need to treat stimuli (e.g., confederates) as random factors because we want to generalize the beyond the 2 or 4 or 8 confederates who happened to be part of our study.

What happens if you regularly equate confederate with condition and use small samples of stimuli? Your effect size estimates will tend to be extremely unstable. Consider this study, which used N = 389 participants but only 10 male and 11 female confederates. They found an enormous sex difference in the opposite direction from the study described above: Confederate attractiveness affected women’s romantic desire much more strongly than men’s. If you were including this study in a meta-analysis, it would be more appropriate to assign it a N of 21 rather than 389 to reflect the imprecision of this particular sex-difference estimate.

So what to do? Power calculations with these designs are complex, but a good start would be to use at least N = 40 or 50 stimuli per condition and treat stimuli as a random factor. Then, any incidental differences between the experimental stimuli would likely wash out, and we could be reasonably confident that any effects of the “manipulation” were truly due to attractiveness. Yes, that’s probably too many stimuli for a study involving live confederates, so you may need to get creative—for example, many speed-dating studies provide this kind of statistical power. [2]

It’s easy to get tripped up by this issue, especially when you have confederates that you’ve carefully selected to differ in an obvious way. But don’t make the mistake. If a confederate is nested within condition in your design, you likely need to reconsider your design.

For more information about stimulus sampling challenges, see detailed discussions by Wells and Windschitl and Westfall and colleagues, as well as this paper that describes stimulus sampling challenges when studying sex differences in particular.

[1] Study 2 of the same paper replicated this interaction using N = 2,781 White participants and N = 24,124 White and Black targets, which allows us to have more confidence in the inference that this interaction is about race rather than peculiarities of particular stimuli. Nevertheless, I assure you that at the time, I would have tried to publish the two-confederate study on its own had I not had access to this larger Study 2 sample.

[2] Alternatively, you could manipulate the attractiveness of a single confederate (e.g., using makeup and clothing); at least one study has successfully done so (see Figure 1 here), although we have found executing such a manipulation to be challenging in our lab.

Tuesday, January 9, 2018

Two Lessons from a Registered Report

Long ago and far away, in Chicago, in 2006, I submitted one of my first papers as a graduate student. The topic was controversial, and so we were not particularly surprised, when the reviews came back, to see that the reviewers were skeptical of the conclusions we drew from our findings. They wanted more (as JPSP reviewers often do). They thought maybe we had overlooked a moderator or two…in fact, they could think of a whole laundry list of moderators that might produce the effect they thought we should have found in our data. So we ran 1,497 additional tests.

No, seriously. We counted. 1,497 post-hoc analyses to make sure that we hadn’t somehow overlooked the tests that would support Perspective X. We conducted them all and described them in the article (but there was still no systematic evidence for Perspective X).

If your work involves controversy, you’ve probably experienced something like this. It’s been standard operating procedure, at least in some areas of psychology.

Now, fast forward to 2017. My student Leigh Smith and I are about to launch a new study in the same controversial topic area, and it’s likely that we’ll get results that someone doesn’t like, one way or another. But this time, before we start conducting the study, we write up an analysis plan and submit it to Comprehensive Results in Social Psychology (CRSP), which specializes in registered reports. The analysis plan goes out for review, and reviewers—who have the luxury of not knowing whether the data will support Perspective X or Y or Z—thoughtfully recommend a small handful of additional analyses that could shed better light on the research question.

The analysis plan that emerges is one that everyone agrees should offer the best test of the hypotheses; importantly, the tests will be meaningful however they turn out. We run the study and report the tests. We submit the paper.

And then, instead of getting a decision letter back asking for 1,497 additional suggestions that someone thought would surely show support for Perspective X…the paper is simply published. The data get to stand as they are, with no poking and prodding to try to make them say something else.

There’s a lot to like about this brave new world.

Our new paper in CRSP addresses whether attractiveness (as depicted in photographs of opposite-sex partners) is more appealing to men than to women. I, like most other evolutionary psychologists, had always assumed that the answer to this question was “yes.”

But you know what? Those prior studies finding that sex difference in photograph contexts? Most of them were badly underpowered by today’s standards. Our CRSP paper used a sample that was powered to detect whether the sex difference was q = .10 (i.e., a small effect) or larger (using a sample of N = ~1,200 participants and ~600 photographs). These photographs came from the Chicago Face Database, and we used the ratings in the database of the attractiveness of each face (based on a sample of independent raters).

The paper has two take-home lessons that are relevant to the broader discussion of best practices:

Is attractiveness more appealing to men
than to women when people look at photographs?
Yes, although the effect is quite small, and
there's little evidence of hidden moderators.
1. Even though prior studies of this sex difference were underpowered, the sex difference was there in our new study: r(Men) = .41, r(Women) = .28, q = .13, 95% CI (.18, .08). There is no chance that the prior studies were powered to find a sex difference as small as what we found. But it was hiding in there, nevertheless.[1]

Lesson #1: Perhaps weakly powered studies in the published literature can still manage to converge on truth. At least, perhaps this happens in cases where the presence or absence of p < .05 is/was not a hard criterion for publication. Sex differences might be one such example. (Still no substitute for a high powered, direct test, of course.)

2. In this literature, scholars have posited many moderators in an attempt to explain why some studies show sex differences and some do not. For example, sex differences in the appeal of attractiveness are supposed to be bigger when people imagine a serious relationship, or when people evaluate potential partners in the low-to-moderate range of attractiveness. Sometimes, sex differences are only supposed to emerge when 2 or 3 or 4 moderators combine, like the Moderator Avengers or something. That wasn’t the case here: These purported moderators did not alter the size of the sex difference in the predicted manner, whether alone or in Avenger-mode combination.

Lesson #2: Perhaps we should be extremely skeptical of moderators that are hypothesized, frequently post hoc, to explain why Study X shows a significant finding but Study Y does not. Moderators within study? I’m on board. Moderators across studies? I’ll believe it when I see it meta-analytically.

For every single research question I dream up going forward, I will consider whether it could be a good candidate for a registered report. When I think about an idealized, all-caps form of SCIENCE that stays untethered from prior perspectives or ideology, that CRSP experience pretty much captures it. [2]


[1] This statement may shock some who think of me as some sort of sex-differences naysayer. Rather, my perspective is that this sex difference is larger in photograph contexts than live face-to-face contexts. Indeed, q = .13 is about 2-4 times larger than meta-analytic estimates of the same sex difference in initial attraction contexts or established close relationships (which are q = .05 or smaller). (Does it make me a naysayer to suggest that the sex differences here are extremely small, and that prior single studies are unlikely to have been powered to detect them?)

[2] And did I mention fast? This project went from “vague idea” to “in press” in less than 11 months. My prior best time for an empirical piece was probably twice as long.