Monday, June 18, 2018

How Critical Are We?

One perennial issue in the best-practices discussion is whether or not our discipline is overly critical or not critical enough. When we evaluate other people’s research, should we be increasing our focus on the positive aspects or the negative aspects?

My current view is “yes, both,” and the moderator [1] is whether we are talking about criticism that takes place pre- or post-publication. Pre-publication, I think we need to dial up the positivity; post-publication, I think we need to dial up the criticism.

=================

Pre-publication peer-review: We can afford to emphasize the positives

Lopsided criticism in peer-review
Here are two ways of envisioning the reviewer’s job. One way: The reviewer is the firewall that protects the world from weak manuscripts by pointing out all their flaws. A second way: The reviewer is a knowledgeable colleague who has been asked to offer input on ways to strengthen the manuscript.

As an editor, I find that—nine times out of ten—the latter approach ultimately makes for a better published literature. Here are two specific reviewer tactics that help tip the pre-publication peer review balance in a more constructive direction:

1. Reviews are especially helpful when they explain what a particular subfield can gain from the manuscript. Given that the manuscripts I handle are rarely (if ever) examining a topic that I myself study, I need reviewers to tell me about the value that the manuscript will have for them and their colleagues. Does the manuscript help to define or organize a problem? Will the findings be useful for other scholars when planning their own studies? Does the manuscript properly situate the findings in the literature they are trying to inform? It is extraordinarily informative when a reviewer says “Wow, my subfield really needs an article that does what this manuscript is trying to do.”

2. Reviews are especially helpful when they avoid getting hung up on small imperfections and inconsistencies that make a story less pretty and glossy. These small imperfections (e.g., a simple main effect was not significant on one of the five tests, different meta-analytic publication bias analyses reveal different conclusions) are very real parts of science, and all articles have some. Indeed, I would argue that the picture-perfect (but impossible) articles of the past emerged because authors and reviewers pushed each other to scrub away the imperfections. [2] As an editor, I prefer reviews that focus in depth on a few big picture concerns, if they exist (e.g., missing a large segment of the literature in a review, using the incorrect statistical test, drawing a conclusion that is not supported by the data). And if there are no big picture concerns, the reviews should so.

Post-publication peer-review: We can afford to be more critical

I think we have a natural tendency to assume that findings enter an “official canon” when they are accepted for publication. Canon is for fiction, like Star Wars and Marvel. As scientists, we must fight the urge to canonize.

Fictional scientist Bruce Banner fights the urge to
transform into the Hulk in two eponymous movies,
but only the second one is canon and "counts."

Real scientists have to fight the urge to canonize
articles just because they have crossed the
threshold from unpublished to published..
After all, good science should spark debates. I study the psychology of mating and relationships because I think much of this literature is debatable and debate-worthy. I criticize others’ approaches; others criticize mine. And I have served as a reviewer of productive back-and-forth debates between other scholars. These experiences were sometimes stressful, but in my view, these criticisms all served to advance the science.

I find it bewildering that some journals and editors are reluctant to devote page space to debates and criticism of previously published work. I have heard people express the opinion that criticism belongs only in the review process; if an article survives this “due process,” it earns a shield against any further published criticism. This attitude has a perverse effect: It prevents debates from moving forward openly for all to evaluate and confines them to a closed review process. I would posit that blogs, Facebook, and Twitter have become popular means of scientific criticism and debate in part because journals do not commonly offer opportunities for the ongoing, post-publication peer-review that is an essential part of science.

I would love to see journals embrace post-publication criticism—especially the thoughtful and productive kind of criticism that could even merit publication on its own. Indeed, I would love to see all journals operate like Behavioral and Brain Sciences or PNAS, where post-publication criticism is encouraged (or even solicited) shortly after the initial release of an article. If we create additional avenues for post-publication peer-review, I think we will see a much needed shift in the balance of criticism in our field.




[1] Hidden?
[2] I have always liked this piece about how changes in our scientific practices require changes on the part of reviewers, too. 


Wednesday, May 16, 2018

Improvements in Research Practices: A Personal Power Ranking


Science is about shifting consensus (Grene, 1985; Longino, 1990). At a given point in time, scientists in a field might believe one thing, and later, they believe something else. For this reason, persuasion is the fuel that powers the scientific enterprise.  

The conversation about best practices is our field over the last decade is no different. It is a persuasion process: Scientists who believe that direct replications or statistical power or preregistration will improve the quality of our science attempt to convince scientists who do not hold this belief to change their views. When confronted with strong evidence, argumentation, and logic, skeptics should be willing to change their beliefs (or else they aren’t really practicing science).

I have been persuaded about many things. Sometimes I was persuaded when I simply learned more about a topic. Other times, I was persuaded because I learned that my previous views were incorrect in some way.

In celebration of scientific persuasion, I thought I would offer my own personal Top-5 power ranking. Relative to ten years ago, I have been persuaded about the value of all of these practices. What follows  is a list of the top 5 improvements in research practices—ranked in the order that I have found them valuable for my own research. [1]

==================

Improving my own Research Practices: Top 5 Power Ranking

5. Use social media for scientific conversation: It is remarkable that scholars of all ranks can turn to social media to learn about research practices, share their knowledge, and debate scientific issues. When I was in graduate school, debates and critical discussions were largely confined to conferences and took place once or twice a year. Now, these conversations happen multiple times a day, with contributions from a diverse set of voices. In this way, social media has made civil scientific critique and debate a normal, everyday activity.

Why not higher? I still think that editors serve an extremely important role in curating scientific criticism and keeping the debate focused on the substantive issues. For reasons I can’t quite fathom, some journals are reluctant to give page space to debates about previously published work, so naturally social media stepped in to fill this void. Nevertheless, I would love to see journals play a larger role in post-publication peer review, perhaps by offering something like the PNAS “letters” format.

4. Conduct direct replications: I now routinely build direct replications into my work. For example, if we want to see whether an effect of Study 1 is moderated in Study 2, I might ensure that the effect in the Study 2 control condition ALSO functions as a direct replication of Study 1. I continue to conduct conceptual replications, of course, but I have certainly shifted my emphasis over the past few years. I now routinely assess the direct replicability of my findings before building on them, especially when I’m doing something new, and I no longer assume that other findings are directly replicable if they have only been demonstrated once.

Why not higher? If we were to over-prioritize direct replications, we could be at risk for enshrining particular operationalizations in lieu of the conceptual variables we really care about. For example, in my home topic area, many findings in the literature on stated mate preferences for traits are directly replicable, but they have ambiguous connections to the conceptual variables of interest: It’s very easy to replicate the finding that men and women say they want different things in a partner, but it’s not clear the extent to which what people SAY they want maps onto what they ACTUALLY want when interacting with real potential partners (see this earlier post). We should not become so focused on direct replications that we forget to care about what our variables are actually measuring.

3. Focus on effect sizes (rather than significance): In graduate school, my programs of research often lived and died by p < .05. I am overjoyed that this trend is shifting; when I focus on effect sizes and confidence intervals rather than an arbitrary black-and-white decision rule, I learn much more from my data. This is especially true when comparing across studies: We used to think “This study was significant but this one was not…what happened?” When we focus on effect sizes, these comparisons take place on a continuum and do not rely on arbitrary cut-offs, and our attention shifts instead to the extent to which effect size estimates are consistent across studies.

Why not higher? I am receptive to the argument that, in many experimental contexts, the effect size “doesn’t really matter” in the sense that the manipulation is not intended for use in an applied context. Nevertheless, even when I run experiments, I still find it extremely useful to compare effect sizes across similar operationalizations, so that I can develop a sense of how confident I should be in a set of results (more confident if the effect sizes are similar across experiments using similar manipulations; less confident if the effect sizes seem to be all over the place).

2. Promote and participate in registered reports: As I noted in a prior post, I am a big fan of registered reports. I love how they function to get both reviewers and authors alike to agree that the results of a particular study will be informative however they turn out. I now think that our studies are generally stronger when we design them with this kind of informative potential from the beginning. I have largely stopped conducting the “shoot the moon” studies that are counterintuitive and cool if they “work” but wouldn’t really change my mind if they don’t.

Why not higher? If registered reports became the norm, what would happen to large pre-existing datasets that are not eligible? Would people stop investing in large-scale efforts going forward? I hope that we develop a registered report format that can make use of pre-existing data (e.g., perhaps in combination with meta-analytic approaches).

1. Improve power: My studies are more highly powered than they once were. And as a result, I feel as though I have been going on fewer wild goose chases: If I see a medium effect size with several hundred participants in Study 1, I would bet money that I am going to see it again in my direct replication in Study 2. In cases where I do make the decision to chase a small effect, that decision is now conscious and careful (i.e., I will decide if it’s really worth it to invest the resources to have adequate power to detect the effect if it is there), and if I decide I do want to chase it with a highly powered study, I learn something from my data no matter what happens.

Even though I ranked this #1, I still see a potential downside. For example, I am still running labor-intensive designs (e.g., confederate studies involving one participant at a time), but they take much longer, and so I am running fewer of them. But I have considered this tradeoff, and my assessment is that I am better off running a few highly powered versions of these studies than many underpowered ones.

Will this be my top 5 power ranking forever?  Probably not. [2] I look forward to future research practice improvements, and to having my mind changed yet again.




Grene, M. (1985). Perception, interpretation, and the sciences: toward a new philosophy of science. In Evolution at a crossroads: The new biology and the new philosophy of science.

Longino, H. E. (1990). Science as social knowledge: Values and objectivity in scientific inquiry. Princeton University Press.


[1] Note that this is not a Top-5 list of what developments convinced me that the field as it existed circa 2010 “had a problem” or “was in crisis.” I have been persuaded on that front, too, but that would be a different list.


[2] If you’re curious, here were four honorable mentions that did not quite make the top 5 for me, in no particular order: Preregistered analysis plans, transparency in reporting methods, selection methods for assessing publication bias, open data.

Monday, March 26, 2018

Testing the replicability of claims about a sex difference: A regrettable delay

A public commitment to update my own beliefs in response to a planned analysis I haven’t seen yet (Part 2)

In Part 1 of this series, I tried to make some headway in the debate over sex differences in the appeal of attractiveness in established relationships by putting my own beliefs on the line, pre-registering an analysis plan to see if a prior result would replicate, and publicly committing to update my beliefs regardless of how the results turned out. Unfortunately, this test will have to wait.

Although I assumed that it would be easy to obtain the data from a just-published manuscript, I was incorrect: Dr. McNulty has informed me that there will be a “regrettable delay” of unknown duration in sharing the data underlying the published manuscript until his team finishes working on and successfully publishes a second manuscript analyzing the same columns of data. Once the second manuscript is successfully published, he will be happy to share the data associated with the first manuscript, but he has no guess about how long that might take. Our full email exchange is included below, with Dr. McNulty’s permission.

I think it is fair to say that he and I are reading the APA ethical principle on data-sharing differently. In light of the field’s growing appreciation of the importance of openly and transparently sharing the data that is used in published manuscripts, I wonder if the language in the APA principle needs to be clarified or updated to reflect current standards in the field. (Indeed, the most surprising element to me of our whole exchange was Dr. McNulty noting that one of his colleagues had advised him against ever sharing the data associated with his published manuscript. Clearly, scholars have very different views about whether and when the data behind published papers should be shared with other researchers, and it seems crucial that our societies and journals provide clear guidance to authors going forward.)

In light of the indefinite and regrettable delay, any claims that this particular sex difference is robust seem premature. I have posted below the results of the Meltzer et al. (2014) 28-covariate analysis, as well as the Eastwick et al. (2014) unsuccessful replication attempt, so that readers can get a sense of the existing evidence for this sex difference. I have also left a blank space for the eventual inclusion of a direct replication from the new McNulty et al. (2018 online publication) dataset. I will fill it in once the data from those N = 233 couples are shared with me and I can conduct the preregistered analyses. 

I’ll close with an exhortation to other scholars: Future tests of this idea should examine it in a confirmatory way (i.e., with a detailed analysis plan that is written ahead of time, before seeing the data). My post did not end the debate, but I do hope that this approach will set a standard that helps researchers come together to address this question with strong methods going forward. 

Results of the 28-covariate analysis proposed by Meltzer et al. (2014) and the one direct replication to date (Eastwick et al., 2014). Meltzer et al. (2014) concluded that the association of coder-rated attractiveness with relationship satisfaction is stronger for men than for women (see first Intercept test). I will update the figure when the data for McNulty et al. (2018 online publication) are made available.
Bars indicate 95% CIs. Y axis is effect size q (interpretable like r).

My preferred approach to testing this sex difference is as follows: a random effects meta-analysis examining the effect of coder-rated attractiveness on relationship evaluations (e.g., satisfaction) in established (i.e., dating and/or married) relationships. That meta-analytic effect (k = 11, N = 2,976), which includes both the Meltzer et al. (2014) and Eastwick et al. (2014) data analyzed above, is shown here:



Bar indicates 95% CI. Y axis is effect size q (interpretable like r).




Emails reprinted here, with permission:

March 7, 2018

Hello Jim,

I hope you enjoyed SPSP this year – it was good to run into you briefly. I am writing to request the data associated with your new paper, which looks really interesting: http://psycnet.apa.org/record/2018-05467-001?doi=1

In addition to the covariates in Table 2 and income (mentioned on p. 4), I would be very appreciative if you would also include extraversion if you have it. But I also recognize that, technically speaking, you are under no obligation to share extraversion given that it wasn’t mentioned in the published article.

My intention is simply to conduct this preregistered analysis plan. If you are curious, I also have written a blog post about the relevant interpretive issues – if you and/or Andrea would like to comment on the second part (once I write it), I would be happy to include your response on the blog.

Regards,

Paul

===========

March 8, 2018

Hi, Paul.

I enjoyed SPSP and it was good to run into you. It was astute of you to realize we have some more data to address our debate. I would be happy to share them with you eventually, but one of Andrea’s doctoral students is currently working on a manuscript that addresses this exact effect. They have been working on it off and on for some time now, but, as is typical, other priorities keep interfering. I fear it could undermine her project to share these valuable data with you and the world right now. That said, I do appreciate complete transparency, as well as your attempts to shed more light on this issue, and I would be happy to share all the data with you once her project is complete. Does that sound okay? I wish I had a good guess as to when that would be, but for some reason I still haven’t figured out how to predict how reviewers will feel about a particular paper. Haha.

Best,
Jim

===========

March 9, 2018

Hi Jim,

I totally understand wanting to make sure that your student will be able to publish his/her paper. And I realize that my email might not have been clear: I was only suggesting that I would report the results on the blog, not a journal article. You should of course be able to carve up the remaining dataset for journal articles as you see fit – I’m only requesting the data that were used in the in press publication (plus extraversion if you had it and were willing to share it -- but of course, I understand that you are under no obligation to do so since it’s not in the published article). I wouldn’t anticipate that a blog post on this particular analysis would interfere with your student’s ability to report and build off of it in a future article.

Regards,

Paul

===========

March 16, 2018

Hi Jim,

I just wanted to follow up with you on the message I sent last week requesting the data from your in press JPSP. I’m still excited to take a look, and I want to reiterate that my plan is only to share the results of the preregistered analyses on a blog (i.e., not a journal publication). In case it helps mitigate the concerns you articulated about wanting to publish analyses based on these data in a separate article, I had an idea: What if I only post the effect sizes and confidence intervals associated with the three sex difference tests that I preregistered (i.e., no other statistical information or detailed descriptives)?

I really hope that we can navigate these data sharing complexities ourselves in a friendly way – I am committed to making some progress on the sex difference question by conducting and reporting the analyses I preregistered on my blog however they turn out, and you of course should be able to publish additional analyses in the future off of these published data. I do think it’s important to keep in mind that the data I am requesting are now published, and that this means that ethically, they must be made available to “other competent professionals” (APA, 8.14, 2010). But I’d much rather do this in a friendly and informal way over email rather than going through the journal or APA or something.

If I don’t hear from you by next Friday (the 23rd), I’ll go ahead and update my blog to indicate that you declined to share the data, and we’ll go from there.

Regards,

Paul

===========

March 20, 2018

Paul,

I understand that you do not plan to pursue publication of the data you requested. And I believe you are probably correct that a blog will not interfere with a future publication. However, I must admit that the blogosphere is extremely foreign to me and I perceive that it seems to have some traction. I also have no idea what the future holds. I see no reason to risk even an unlikely negative outcome for one of our students. I’m not sure I was clear in my original email, but the student is not simply working with these data; she is working on a manuscript describing the sex difference in the association between partner attractiveness and marital satisfaction—the precise effect in question. I have received advice from two colleagues who are unattached to this debate and they tell me not to share the data yet (one says don’t share it at all).

Regarding any ethical obligation to share the data with you, my read of the APA ethics statement on this issue is that I am only obligated to share with “other competent professionals” who intend to replicate the result in question. APA Ethical Principles specify that "after research results are published, psychologists do not withhold the data on which their conclusions are based from other competent professionals who seek to verify the substantive claims through reanalysis and who intend to use such data only for that purpose, provided that the confidentiality of the participants can be protected and unless legal rights concerning proprietary data preclude their release" (Standard 8.14).”Retrieved: http://www.apa.org/pubs/journals/psp/index.aspx?tab=4. You left out of your email the critical qualifier that I bolded above. It is quite clear from your email, and from the fact that you preregistered a completely unrelated analysis of my covariates, that you have no intentions to verify our substantive claims but instead want to capitalize on our covariates to address your own research goals.

To be honest with you, Paul, what is frustrating to me about your latest email that threatens to post on your blog that I declined your request and potentially take up this issue with APA is that I did not decline your request. As I said in my original email, I will give you the data after the student working on this exact effect is finished, even though I do not believe I am obligated to do so, because I too am committed to science and understanding this sex difference. If you post anything on your blog about this other than the fact that there will be a regrettable delay in getting the data from us, please also post this entire string of emails so people can decide for themselves if I am being unethical.

Jim

===========

March 22, 2018

Hi Jim,

Thanks for your reply. It seems like we have different interpretations of the APA data-sharing principle (at least as it applies in this case). I thought it was self-evident that my proposed analysis was addressing a “substantive claim” of your published manuscript: You tested and reported a sex difference in the partner attractiveness-infidelity association, and concluded the following on pp. 15-16: “This latter sex difference is consistent with evidence that partner attractiveness is more important to men than it is to women (Li et al., 2013; McNulty, Neff, & Karney, 2008; Meltzer et al., 2014a, 2014b), and thereby challenges the idea that the importance of partner attractiveness is equivalent across men and women (see Eastwick & Finkel, 2008).” You had the opportunity to conduct the same analysis that you and your colleagues have argued is the best test of this sex difference (Meltzer et al., 2014a; this is the analysis I proposed in my blog post) to see if the Meltzer et al. (2014a) findings would replicate in this new dataset. Although you did not report this analysis, you claimed in the Discussion of your paper to have supported those findings anyway.

In my blog post, I proposed to reanalyze the data from your published paper in order to test the claim that “partner attractiveness is more important to men than it is to women” (p. 16). To me, it seems like the APA data-sharing principle (as well as the field’s current norms about the importance of openness and transparency) applies here. Nevertheless, I agree that multiple interpretations of the APA principle are possible and I appreciate your willingness to engage with me on this issue.

I’m disappointed that there will be a regrettable delay (as you note) in your sharing of these data. I’m also sad to hear that, in this day and age, your colleagues are advising you to delay or avoid sharing the data behind a published paper. I appreciate your willingness to allow me to post our email exchange, and I apologize if you worried that I would misrepresent you – that was definitely not my intention, and I agree with you that it is important to post the exchange for transparency’s sake.

Regards,

Paul

PS: Despite all this, I really do think the new paper is cool. One of the questions it addresses had come up a few days beforehand in my grad class.


Wednesday, March 7, 2018

Going on the record via preregistration

A public commitment to update my own beliefs in response to a planned analysis I haven’t seen yet (Part 1)

Update, 3/26/18: Unfortunately, my request for the data behind this recently published JPSP paper (McNulty, Meltzer, Makhanova, and Maner, 2018 online publication) was unsuccessful. Dr. McNulty has informed me that there will be a “regrettable delay” of unknown duration in sharing these now published data until his team writes up and successfully publishes a second manuscript on these same data columns. Part 2 of this blog post is here, along with our email exchange about the data sharing question. 

In my previous post, I talked about how essential it is that we, as scientists, remain open to the possibility of having our intuitions disconfirmed.

Now let’s see if I can put my money where my mouth is.

If I take my own admonishment seriously, I need to be willing to have my own intuitions and beliefs disconfirmed—even when those beliefs have developed through years of researching a particular topic.

Here’s one of my own findings in which I have a high degree of confidence. In a meta-analysis I conducted about five years ago, we examined whether a partner’s attractiveness was more romantically appealing to men than to women. We acquired a large collection of published and unpublished datasets (k = 97, N = 29,780) that spanned a variety of paradigms in which men and women reported on partners they had (at a minimum) met face-to-face. Overall, we found that the sex difference in the appeal of attractiveness was not significantly different from zero, and it did not matter whether the study examined initial attraction (e.g., speed-dating, confederate designs) or established relationships (e.g., dating couples, married couples).

Here is a hypothetical illustration of this finding: If a man’s satisfaction in a given relationship is predicted by his female partner’s attractiveness at r = .08, we might find that a woman’s satisfaction is predicted by her male partner’s attractiveness at about r = .03. Meta-analytically, the sex difference is about this size: r(difference) = .05 or smaller. You can interpret this r(difference) like you would interpret r = .05 in any other context—really small, hard to detect, and probably not practically different from zero.
However you slice the meta-analytic data, it is hard to find a sex difference in the appeal
of attractiveness in paradigms where participants have met partners face-to-face.
(p refers to the p value of the sex difference test statistic Qsex.) From here.

Interestingly, the sex difference in attractiveness is much larger when you ask men and women to use a rating scale to indicate how much they think they like physical attractiveness in a partner. The size of this “stated preference” sex difference is about r = .25 (see Table 1 in this paper). [1]

In other words, an r = .25 effect when people make judgments about what they think they like drops to r = .05 when people are responding to partners who they have actually met in real life. 

I find this “effect size drop” deeply fascinating. It opens two interesting questions that have guided much of my research:

1. If men and women truly differ in the extent to which they believe attractiveness to be important in a partner, what factors interfere with the application of these ideals when they evaluate partners in real life?

2. If there is essentially no difference between men and women in how much they actually prefer attractiveness in a real life partner, what sorts of social-cognitive biases might produce the sex difference in how much people think they prefer attractiveness in a partner?

I have spent considerable time and effort in the last decade examining these two questions in my research. We’ve found some answers, and yet there’s still a long way to go in this topic area.

All effect sizes are coded so positive values mean that attractiveness receives higher
ratings/is a larger predictor for men than for women. I am prepared to update the
table after I examine the new McNulty et al. (in press) data according to my
preregistered analysis plan.
But back to my belief that I am putting on the line in this blog post: I believe that the sex difference is about r = .05 (or smaller) when people evaluate real-life partners. I feel pretty confident about this belief, given all the evidence I have seen. But there are other scholars who believe something entirely different.

================

Since we published the meta-analysis, two empirical articles have taken a strong stance against our conclusion that the sex difference in the appeal of attractiveness is small or nonexistent. I discussed one of them (Li et al., 2013) in an earlier post; given the tiny effective sample size of that study, I won’t discuss it further here. Instead, let’s talk about the second one: Meltzer, McNulty, Jackson, & Karney (2014).

This paper found the expected sex difference in a sample of N = 458 married couples. In brief, they found that women’s attractiveness predicted men’s satisfaction at r = .10, whereas men’s attractiveness predicted women’s satisfaction r = -.05. That’s an r(difference) of .15—still pretty small, but not zero (p = .046).

One unusual element of this paper is that the authors only present this sex difference in one analysis, and it included a large number of covariates. Twenty-eight of them, to be exact. Another element worth noting is that there were actually two ways that the sex difference could have emerged—on the intercept of satisfaction or the slope of satisfaction. The effect that the authors focused on was the intercept; slope effects did not differ for men and women, r(difference) = .02.

Personally, I don’t believe that this analysis provides an accurate depiction of the sex difference. It’s hard for me to buy into the idea that you need twenty-eight covariates in this analysis, and even then, the sex difference only emerges in one place and not the other. In fact, we conducted an identical analysis on some of our own data that had the same variables, and we didn’t find a hint of the sex difference (if anything, the slope effect trended in the opposite direction).

Nevertheless, for the past five years, this debate gets distilled to “Team X says no sex difference, but Team Y says yes.” If someone wants to cite evidence for the absence of the sex difference, they have it; if someone wants to cite evidence for the presence of the sex difference, they can do that, too. This does not seem to be a good scientific recipe for getting closer to the truth.

I’m pretty confident in my belief that the sex difference here is tiny or nonexistent. But you know what? Maybe I’m wrong. If I want to call myself a scientist, I have to be open to that possibility. I have to be willing to say: Here are the data that would convince me to change my belief.

So here it is: I will update my belief if a preregistered test, using the same 28-covariate analysis in a new dataset, replicates the sex difference on the intercept found in Meltzer et al. (2014).

You may be thinking, it’s easy for me to say that, so long as no dataset of the kind exists. But in fact, just the other day, I saw this new published paper (McNulty, Meltzer, Makhanova, & Maner, in press). It primarily examines a different (and totally fascinating!) research question, and it uses a new sample of N = 233 couples. But buried in the descriptions of the covariates in that paper are all of the key variables and all but one of the covariates required to directly replicate the earlier sex difference analysis reported in Meltzer et al. (2014).

Here is what I am committing to, publicly, right now: I have written up a preregistered analysis plan that provides the test I outline above. I will email Jim McNulty for the data they used in this new published manuscript, which I am confident that he will share with me. I will run the preregistered analysis on these data, and I will describe the results as a “Part 2” of this blog post. If the key finding from Meltzer et al. (2014) replicates—that is, if the sex difference on the intercept is significant—then I need to seriously consider the possibility that I am wrong, and I need to update my beliefs accordingly. If it is not, I hope that those scholars who believe in this particular sex difference will be willing to update their beliefs and/or conduct a highly powered test of their prediction.  

Either way, we’ll be getting closer to the truth rather than being stuck in an endless circle around it.




[1] When people talk about the “robust literature” showing that attractiveness matters more to men than to women, they could be talking about one of two things. First, they could be talking about this stated preference sex difference. Second, they might be talking about findings showing that, in hypothetical settings (e.g., viewing photographs), attractiveness tends to matter more to men than to women. In fact, we preregistered a study examining this context and found the sex difference! As I described in this earlier post, the size of the sex difference that we found in a very highly powered design was r = .13. 

Tuesday, February 20, 2018

Intuitive and Impossible: What do Short-Term and Long-Term Relationships Look Like?

People have long-term relationships and short-term relationships. In what ways do these two kinds of relationships differ?

You may find the answer to be extremely intuitive—or extremely counterintuitive—depending on your lay theories about relationships, or depending on which segment of the literature on human mating is more familiar to you.

The ReCAST Model. Double lines are long-term relationships,
and the single line is a short-term relationship.
In a recent paper, we collected data on people’s real-life relationships over time—beginning at the first moment they met a partner—to compare the relationships that people think of as “long-term” and “short-term.” There is a vast literature that asks people what they want in these kinds of relationships, but there is far less data on people’s real life experiences with short-term and long-term relationships and partners. We wanted to know: How exactly do these types of relationships differ, and when do these differences become apparent? It took us about 4 years to collect and publish these data, and they helped us inform and develop something we call the ReCAST model.

Perhaps the most important finding was this one: Differences did not emerge right away. That is, it took a considerable period of time—typically weeks or months—for short-term and long-term relationships to diverge. Put another way: You can’t tell, early on, whether a relationship is short-term or long-term; the trajectories only pull apart once you’ve known someone for quite awhile.

We have a high degree of confidence in these findings.[1] But here is today’s question: Are these findings intuitive and obvious? 

According to one type of reviewer (we had two reviewers like this), these data are extremely intuitive. These reviewers said: Researchers studying close relationships already know that relationships unfold gradually over time. Of course you cannot predict how long a relationship will last until two people have a chance to interact, assess interpersonal chemistry, and (preferably) have a few make-out sessions. These assumptions are built into the fabric of everything we have done for the past 30 years. Why would you try to test or publish something so obvious?

To another type of reviewer (we had four reviewers like this), these results were highly implausible. These reviewers said: Researchers studying evolved strategies know that people approach relationships very differently depending on whether that relationship is short-term or long-term. For example, women can view a photograph of a man and know from his chiseled features that he is good for a short-term but not a long-term relationship. Your data are at odds with the assumptions that are built into the fabric of everything we have done for the past 30 years. You can’t possibly be testing these predictions correctly—if your methods were right, you would have gotten different results. Therefore, these data shouldn’t be published.

Together, these reviews characterized our data as simultaneously obvious and implausible. And this juxtaposition highlights the risk of drawing on intuition when making scientific critiques.

=================

Here is a short history of the Pendulum of Intuitiveness in psychological journals.

When I was in graduate school in the early-mid 2000s, the easiest way to get rejected from a journal was to try to publish something that felt obvious and familiar. One way that people would try to combat this pressure: Find a result that was counterintuitive. Hopefully, very counterintuitive. Like “wow, can you believe it?!” counterintuitive.

Sometimes, though, that counterintuitive finding didn’t emerge from a deep dive into two theories to discover where they made divergent predictions. Rather, the finding was something flashy—something a lay person wouldn’t have expected. Conducting data analysis felt more like gambling than detective work; ten obvious p < .05s were worth a lot less than one shocking (and perhaps “lucky”) p < .05. These pressures and strategies probably led to the publication of some counterintuitive findings that would be tough to replicate over some intuitive but easily replicable ones.

But within the last few years, terms like “counterintuitive” have become radioactive in the wake of recent methodological advances in our field. In other words, if a result seems surprising to you, now there is reason to suspect that it might be “too good to be true.”

The counterintuitive backlash makes sense. But it’s not a sufficient place to stop: Unless we want to keep swinging with the pendulum, we have to remember to continually question our intuitions at the same time. If we’re not willing to test our intuitions and publish the results—whether those results are themselves intuitive or counterintuitive—we sound more like advocates for “stuff we already know” than scientists asking questions about the world.

So intuition may be great for inspiring study ideas and informing your own personal Bayesian priors about whether a study is likely to work or replicate. But it is not a substitute for actual empirical research. And if that research is appropriately-powered, theoretically grounded, and well conducted, the findings have value regardless of whether they happened to confirm or disconfirm your intuitions. After all, one scholar’s intuitive may be another scholar’s impossible.

---------------------------

[1] Please, please replicate us! The materials and preregistration can be found here. And don’t hesitate to email me if you have questions.

Monday, January 22, 2018

A Confederate is not a Condition

I made a mistake. I equated a person with an experimental condition.

In Study 1 of this article, we introduced N = 54 men to both a White and a Black female confederate in two separate face-to-face interactions. These two confederates – we’ll call them “Hannah” and “Kiara” (not their real names) – played their roles superbly and never forgot their lines. The study was a model of experimental control.

But the inferences I drew from these data were incorrect because of a statistical issue I did not appreciate at the time.

How would you label this pair of "conditions"?
What we found was this: The men in our study (all of them White) tended to like the White confederate to the extent that they were politically conservative, but the men liked the Black confederate to the extent that they were liberal. I drew the inference that political orientation was associated with whether the men were attracted to members of their racial ingroup (i.e., the White partner) or outgroup (i.e., the Black partner).

But a logically equivalent description of these results reveals my inferential overreach: The men in our study liked Hannah more to the extent that they were politically conservative, but they liked Kiara more to the extent that they were liberal. The results might have been attributable to the women’s race…or to any of the other myriad differences between these two particular women.[1]

This is why you sample stimuli as well as participants.  Arguably, my sample size was not N = 54 (the number of participants), but N = 2 (the number of stimuli).

===============

The above example may seem pretty straightforward to you, but the same issue frequently turns up in subtler—but equally problematic—forms. Let’s say I hypothesize that attractiveness inspires romantic desire more for men than for women in a face-to-face, heterosexual interaction. This makes intuitive sense…anecdotally, men seem to talk more about how hot women are than vice versa. Perhaps surprisingly, then, this sex difference does not emerge in speed-dating contexts where people meet a slew of opposite-sex partners who naturally vary in attractiveness (see here and direct replication here). But maybe it would emerge with a manipulation of attractiveness: If men and women each met an attractive and an unattractive partner, maybe this within-subjects attractiveness manipulation would inspire romantic desire more for men than for women?

From Li et al. (2013). Each bar was
generated by 42-51 raters but only 2 targets.
Here’s a study that used exactly this approach to test the hypothesis that attractiveness will matter more for inspiring romantic desire in men than in women.  It seems to find—and is frequently cited as showing—evidence for the hypothesized sex difference: In the figure on the right, one can clearly see that men differentiated the attractive and unattractive confederates much more strongly than women did. 

But notice that this study has the same serious flaw that I described above with my confederate study. To see why, let’s once again use (fake) names: The men desired Rachel and Sally much more than Amanda and Liz, whereas women desired Brian and Karl just a bit more than James and Dan. The results certainly tell us something about the desirability of these particular confederates. But with such a small N (only 2 confederates per condition), we cannot generalize these findings to say anything meaningful about attractive and unattractive targets in general.

What is the N of this design: 93 or 8?
The problem here is that stimuli (in this case, confederates) are nested within condition, just like participants are nested within condition in a between-subjects design. In order to generalize our results beyond the specific people who happen to be in our sample, we have to treat participant as a random factor in our designs. The same logic applies to stimuli: When they are nested within condition, we need to treat stimuli (e.g., confederates) as random factors because we want to generalize the beyond the 2 or 4 or 8 confederates who happened to be part of our study.

What happens if you regularly equate confederate with condition and use small samples of stimuli? Your effect size estimates will tend to be extremely unstable. Consider this study, which used N = 389 participants but only 10 male and 11 female confederates. They found an enormous sex difference in the opposite direction from the study described above: Confederate attractiveness affected women’s romantic desire much more strongly than men’s. If you were including this study in a meta-analysis, it would be more appropriate to assign it a N of 21 rather than 389 to reflect the imprecision of this particular sex-difference estimate.

So what to do? Power calculations with these designs are complex, but a good start would be to use at least N = 40 or 50 stimuli per condition and treat stimuli as a random factor. Then, any incidental differences between the experimental stimuli would likely wash out, and we could be reasonably confident that any effects of the “manipulation” were truly due to attractiveness. Yes, that’s probably too many stimuli for a study involving live confederates, so you may need to get creative—for example, many speed-dating studies provide this kind of statistical power. [2]

It’s easy to get tripped up by this issue, especially when you have confederates that you’ve carefully selected to differ in an obvious way. But don’t make the mistake. If a confederate is nested within condition in your design, you likely need to reconsider your design.

For more information about stimulus sampling challenges, see detailed discussions by Wells and Windschitl and Westfall and colleagues, as well as this paper that describes stimulus sampling challenges when studying sex differences in particular.



[1] Study 2 of the same paper replicated this interaction using N = 2,781 White participants and N = 24,124 White and Black targets, which allows us to have more confidence in the inference that this interaction is about race rather than peculiarities of particular stimuli. Nevertheless, I assure you that at the time, I would have tried to publish the two-confederate study on its own had I not had access to this larger Study 2 sample.

[2] Alternatively, you could manipulate the attractiveness of a single confederate (e.g., using makeup and clothing); at least one study has successfully done so (see Figure 1 here), although we have found executing such a manipulation to be challenging in our lab.

Tuesday, January 9, 2018

Two Lessons from a Registered Report

Long ago and far away, in Chicago, in 2006, I submitted one of my first papers as a graduate student. The topic was controversial, and so we were not particularly surprised, when the reviews came back, to see that the reviewers were skeptical of the conclusions we drew from our findings. They wanted more (as JPSP reviewers often do). They thought maybe we had overlooked a moderator or two…in fact, they could think of a whole laundry list of moderators that might produce the effect they thought we should have found in our data. So we ran 1,497 additional tests.

No, seriously. We counted. 1,497 post-hoc analyses to make sure that we hadn’t somehow overlooked the tests that would support Perspective X. We conducted them all and described them in the article (but there was still no systematic evidence for Perspective X).

If your work involves controversy, you’ve probably experienced something like this. It’s been standard operating procedure, at least in some areas of psychology.

Now, fast forward to 2017. My student Leigh Smith and I are about to launch a new study in the same controversial topic area, and it’s likely that we’ll get results that someone doesn’t like, one way or another. But this time, before we start conducting the study, we write up an analysis plan and submit it to Comprehensive Results in Social Psychology (CRSP), which specializes in registered reports. The analysis plan goes out for review, and reviewers—who have the luxury of not knowing whether the data will support Perspective X or Y or Z—thoughtfully recommend a small handful of additional analyses that could shed better light on the research question.

The analysis plan that emerges is one that everyone agrees should offer the best test of the hypotheses; importantly, the tests will be meaningful however they turn out. We run the study and report the tests. We submit the paper.

And then, instead of getting a decision letter back asking for 1,497 additional suggestions that someone thought would surely show support for Perspective X…the paper is simply published. The data get to stand as they are, with no poking and prodding to try to make them say something else.

There’s a lot to like about this brave new world.

Our new paper in CRSP addresses whether attractiveness (as depicted in photographs of opposite-sex partners) is more appealing to men than to women. I, like most other evolutionary psychologists, had always assumed that the answer to this question was “yes.”

But you know what? Those prior studies finding that sex difference in photograph contexts? Most of them were badly underpowered by today’s standards. Our CRSP paper used a sample that was powered to detect whether the sex difference was q = .10 (i.e., a small effect) or larger (using a sample of N = ~1,200 participants and ~600 photographs). These photographs came from the Chicago Face Database, and we used the ratings in the database of the attractiveness of each face (based on a sample of independent raters).

The paper has two take-home lessons that are relevant to the broader discussion of best practices:

Is attractiveness more appealing to men
than to women when people look at photographs?
Yes, although the effect is quite small, and
there's little evidence of hidden moderators.
1. Even though prior studies of this sex difference were underpowered, the sex difference was there in our new study: r(Men) = .41, r(Women) = .28, q = .13, 95% CI (.18, .08). There is no chance that the prior studies were powered to find a sex difference as small as what we found. But it was hiding in there, nevertheless.[1]

Lesson #1: Perhaps weakly powered studies in the published literature can still manage to converge on truth. At least, perhaps this happens in cases where the presence or absence of p < .05 is/was not a hard criterion for publication. Sex differences might be one such example. (Still no substitute for a high powered, direct test, of course.)

2. In this literature, scholars have posited many moderators in an attempt to explain why some studies show sex differences and some do not. For example, sex differences in the appeal of attractiveness are supposed to be bigger when people imagine a serious relationship, or when people evaluate potential partners in the low-to-moderate range of attractiveness. Sometimes, sex differences are only supposed to emerge when 2 or 3 or 4 moderators combine, like the Moderator Avengers or something. That wasn’t the case here: These purported moderators did not alter the size of the sex difference in the predicted manner, whether alone or in Avenger-mode combination.

Lesson #2: Perhaps we should be extremely skeptical of moderators that are hypothesized, frequently post hoc, to explain why Study X shows a significant finding but Study Y does not. Moderators within study? I’m on board. Moderators across studies? I’ll believe it when I see it meta-analytically.

For every single research question I dream up going forward, I will consider whether it could be a good candidate for a registered report. When I think about an idealized, all-caps form of SCIENCE that stays untethered from prior perspectives or ideology, that CRSP experience pretty much captures it. [2]

Notes:

[1] This statement may shock some who think of me as some sort of sex-differences naysayer. Rather, my perspective is that this sex difference is larger in photograph contexts than live face-to-face contexts. Indeed, q = .13 is about 2-4 times larger than meta-analytic estimates of the same sex difference in initial attraction contexts or established close relationships (which are q = .05 or smaller). (Does it make me a naysayer to suggest that the sex differences here are extremely small, and that prior single studies are unlikely to have been powered to detect them?)

[2] And did I mention fast? This project went from “vague idea” to “in press” in less than 11 months. My prior best time for an empirical piece was probably twice as long.