Effect Size: Why Psychology Research So Often Fails

You've probably seen it. "Chocolate makes you slim!" – the headline went viral in 2015. Newspapers, online portals, talk shows: Everyone reported euphorically about a "scientific study" proving that chocolate consumption helps with weight loss. The effect size? Tiny. Cohen's d was about 0.2 – practically meaningless. Participants lost only about 10 percent more weight than the control group – on a low-carb diet, which means little in absolute terms. But the media? They wrote "15 percent more weight loss" and completely omitted that these were relative improvements that were practically barely relevant.

Here's the problem. Johannes Bohannon, the journalist behind this study, had deliberately designed it as a PR stunt. He wanted to show how easily media and journalists abuse science. And it worked. People wanted to believe that chocolate makes you slim. So they ignored the numbers. Such cases reveal a systemic problem.

How the Media Sells Science

The reporting on the chocolate study reveals systemic failure. Journalists rarely read the original studies. Instead, they rely on press releases – which are optimized for clicks, not accuracy.

Relative risk instead of absolute risk. That's the most popular trick. "50 percent increased cancer risk" sounds dramatic. That this represents an increase from 2 to 3 percent? Nobody cares. The headline is born, the context dies. Clickbait beats accuracy every time.

It's not just about lazy journalists. It's about a system that punishes accuracy. Those who explain the complex context of a study lose readers. Those who write snappy, false headlines gain clicks. The incentives are perverse. Readers make decisions based on half-truths.

The Replication Crisis: The Numbers Speak for Themselves

Imagine you're building a house. The architect says: "The foundation holds with 39 percent probability." Would you move in?

That's exactly what's happening in psychology. The Reproducibility Project (Open Science Collaboration, 2015) attempted to replicate 100 studies from three top journals. The result: Only 39 percent of original findings could be reproduced. In the original studies, 97 percent showed "significant" results. In replications, only 36 percent. Effect sizes in replications were about half as large as in originals.

These aren't exceptions. This is a systemic problem.

The p-Value Fetish: Why 0.05 Is a Problem

What does p < 0.05 actually mean? Not what you think.

A p-value doesn't tell you how likely it is that your hypothesis is correct. It only tells you how likely your data are – assuming the null hypothesis is true. That's a subtle but critical difference.

Simmons, Nelson, and Simonsohn (2011) showed in their study "False-Positive Psychology" what happens when researchers use flexibility. With standard "researcher degrees of freedom" – the leeway in when to stop measuring, which variables to analyze, which outliers to remove – they could generate a "significant" effect in 61 percent of cases. Even when no effect existed.

The American Statistical Association (Wasserstein & Lazar, 2016) published an unprecedented statement in 2016. Six principles for proper handling of p-values. The core message: A p-value doesn't measure effect size or result importance.

The system rewards significance without regard for effect size. This opens the door for manipulation.

Effect Sizes: The Forgotten Currency

Here's where Cohen's d comes in. Jacob Cohen (1988) established conventions for effect sizes: d = 0.2 is a small effect, d = 0.5 is a medium effect, d = 0.8 is a large effect.

A small effect with d = 0.2 is barely visible. Practically irrelevant. And this is exactly what the replication crisis boils down to.

The power posing example illustrates this perfectly. The original study by Carney, Cuddy, and Yap (2010) with only 42 participants claimed that a dominant posture increases testosterone and decreases cortisol. The replication by Ranehill et al. (2015) with 200 participants found: no hormonal effects. The effect on power feeling was only d ≈ 0.2 – and only in men.

The original effect sizes were systematically inflated. The original showed d ≈ 0.6. The replication d ≈ 0.2. That's the difference between "impressive" and "who cares?"

The difference between statistical and practical significance is crucial. A p-value tells you that a difference exists. Effect size tells you whether that difference is even relevant.

The Major Scandals: From Stapel to Gino

The numbers from the replication crisis show statistical problems. Add to that deliberate fraud in research.

Diederik Stapel (2011): 58 retracted publications. The biggest fraud case in psychology history. The Dutch social psychologist invented complete datasets over years. He claimed to have conducted studies that never happened. The Levelt Committee (2012) spoke of a "culture of bad science" – a system that enabled and protected deception.

Hans-Ulrich Wittchen (2019-2024): The PPP study at TU Dresden. Budget: 2.5 million euros from the German health system. The "Staffing in Psychiatry and Psychosomatics" study was supposed to form the basis for new care guidelines. Instead: invented clinic data, too few participating clinics, possible misuse of project funds. The new guidelines were implemented – without the study data. In 2024, fraud charges were filed.

Francesca Gino (2023-2025): The irony is hard to overstate. A Harvard professor who researched honesty. Data Colada – the blog trio Simonsohn, Simmons, and Nelson – uncovered manipulations in four papers. The Excel analysis showed: rows had been manually moved between conditions. The forensic analysis of calcChain.xml proved: someone had manually manipulated data. Harvard's 1,300-page investigation report confirmed the misconduct. In March 2024, she was suspended. In 2025, her tenure was revoked – the first time in Harvard history since the 1940s.

All three cases share commonalities: Prominent researchers. Years of ongoing deception. Discovery by outsiders, not peer review. Effect sizes that were too good to be true.

Why the System Rewards Fraud

The problem isn't individual moral failure. The problem is structural – and encompasses all levels: researchers, journals, media, public.

"Publish or Perish" – publish or disappear. Careers depend on publication counts. Journals prefer positive results 96 percent to 44 percent (Lakens, 2021). Negative results disappear in the file drawer. Effect sizes hardly play a role in publication decisions.

This leads to a dangerous cycle: p-hacking – analyzing until p < 0.05 is reached. Selective reporting – only reporting significant results from multiple measurements. And when all else fails: data fabrication.

Most researchers start with legitimate leeway. Career pressure increases. Incremental ethical compromises accumulate. After ten years, you're Stapel. Or Gino.

The media amplifies the problem. They take what journals publish and twist it further for clicks. The public is left with the finished product: a world where chocolate makes you slim and power posing changes hormones.

Effect sizes could serve as protection. If journals required that significant results also be practically relevant – many "significant" findings would be recognized as trivial. The incentive to manipulate would decrease.

Open Science: The Revolution

There's hope. The Open Science movement is growing.

Preregistered studies establish the hypothesis before the experiment. Registered Reports flip the peer review process: the journal accepts or rejects the paper based on methodology – before results are known.

The numbers are encouraging. Lakens (2021) compared standard literature with Registered Reports: 96 percent positive results in standard literature, only 44 percent in Registered Reports. This isn't a deterioration in quality. This is simply honest.

The Many Labs projects show that replication works. Many Labs 2 (Klein et al., 2018) tested 28 effects in 125 samples from 36 countries with 15,305 participants. This works. When done right.

The Center for Open Science established the TOP Guidelines (Transparency and Openness Promotion). Over 1,000 journals have adopted them. Platforms like OSF and AsPredicted make preregistration easy.

The change is slower than it should be. But it's happening.

What You Can Do Now

You don't need to be a statistician to read scientific studies critically. Three questions suffice: How large is the sample? Everything under n = 100 is suspicious. Was the effect size reported? If only the p-value is given, half the story is missing. Is the study preregistered? An indicator of greater trustworthiness.

Check the effect size before trusting a study.

If you conduct research yourself or use studies for decisions: Demand effect sizes. Ignore p-values without context. Ask about practical relevance. And when the next headline proclaims that X causes Y – ask for the numbers. Behind the headline.

Conclusion

Psychology stands at a crossroads. The system of recent decades has failed. 39 percent replication rate. 58 retracted publications by Stapel alone. A p-value system that invites manipulation. And media that make everything worse.

But there's a way out. Effect sizes must become the standard. Preregistration must be the norm. Registered Reports must reach the mainstream. And you – you must read critically.

The next study you read – ask for the effect size. Not just the p-value. That's the first step back to trustworthy science.

Sources

Bohannon, J., Koch, D., Homm, P., & Driehaus, A. (2015). Chocolate with High Cocoa Content as a Weight-Loss Accelerator. International Archives of Medicine, 8(55). (Retracted)

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366. https://doi.org/10.1177/0956797611417632

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129-133. https://doi.org/10.1080/00031305.2016.1154108

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

Levelt Committee, Noort Committee, & Drenth Committee (2012). Flawed science: The fraudulent research practices of social psychologist Diederik Stapel.

Lakens, D. (2021). Sample size justification. Collabra: Psychology, 7(1), 33267. https://doi.org/10.1525/collabra.33267

Klein, R. A., et al. (2018). Many Labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science, 1(4), 443-490. https://doi.org/10.1177/2515245918810225

Carney, D. R., Cuddy, A. J., & Yap, A. J. (2010). Power posing: Brief nonverbal displays affect neuroendocrine levels and risk tolerance. Psychological Science, 21(10), 1363-1368.

Ranehill, E., et al. (2015). Assessing the robustness of power posing: No effect on hormones and risk tolerance in a large sample of men and women. Psychological Science, 26(5), 653-656.

Nosek, B. A., et al. (2015). Promoting an open research culture. Science, 348(6242), 1422-1425. https://doi.org/10.1126/science.aab2374