The Replication Crisis: When Science Couldn't Reproduce Itself

Charlottesville, Virginia, 2015. The Open Science Collaboration publishes results from the Reproducibility Project: Psychology.

They attempted to replicate 100 studies from top psychology journals.

Only 36% replicated.

Not 36% had minor differences. 36% showed the same effect at all. 64% of published, peer-reviewed findings couldn't be reproduced.

This wasn't isolated to psychology:

Cancer biology (2012): Amgen scientists tried to replicate 53 "landmark" studies. Only 6 (11%) replicated.

Preclinical research (2011): Bayer couldn't replicate 65% of published findings.

Economics (2016): Only 61% of studies replicated.

The replication crisis was everywhere.

Peer-reviewed papers in top journals—findings that shaped policy, guided treatment, informed theory—couldn't be reproduced.

Not because of fraud (though some was fraud). Because of:

Statistical manipulation (p-hacking)
Publication bias (negative results unpublished)
Underpowered studies (too small to detect real effects)
Flexibility in analysis (researcher degrees of freedom)
Career incentives (publish or perish)

The system that was supposed to ensure quality—peer review, replication, scientific method—broke down.

Science's hardening had created perverse incentives. Publish quickly. Produce novel results. Get citations. Get grants. Get tenure.

Quality became secondary to quantity.

And when researchers tried to replicate published findings—the foundation of scientific knowledge—half of them failed.

This wasn't a crisis of understanding (like quantum mechanics) or a crisis of compatibility (like relativity vs. quantum mechanics).

This was a crisis of trust.

If published findings don't replicate, what is scientific knowledge?

Let's examine how the replication crisis was discovered, what caused it, why the system incentivized bad science, and whether science can fix itself.

THE DISCOVERY: Something Is Wrong

EARLY WARNINGS (2005-2011)

JOHN IOANNIDIS (2005): ┌─────────────────────────────────────────┐ │ Paper: "Why Most Published Research │ │ Findings Are False" │ │ ↓ │ │ Argument: │ │ • Low statistical power │ │ • Publication bias │ │ • Financial interests │ │ • Flexibility in analysis │ │ ↓ │ │ Conclusion: Majority of findings likely │ │ false │ │ ↓ │ │ Controversial but prescient │ └─────────────────────────────────────────┘

PHARMACEUTICAL REPLICATION FAILURES: ┌─────────────────────────────────────────┐ │ Amgen (2012): 6/53 (11%) replicated │ │ Bayer (2011): 35% replicated │ │ ↓ │ │ Companies losing millions on failed │ │ drug development │ │ ↓ │ │ Based on published academic research │ │ that didn't replicate │ └─────────────────────────────────────────┘

THE SCALE BECOMES CLEAR: ┌─────────────────────────────────────────┐ │ Not isolated incidents │ │ ↓ │ │ Systematic problem │ │ ↓ │ │ Across multiple fields │ └─────────────────────────────────────────┘

By 2011, it was clear: Science had a reproducibility problem.

THE REPLICATION STUDIES: Testing Published Science

REPRODUCIBILITY PROJECT: PSYCHOLOGY (2015)

THE STUDY: ┌─────────────────────────────────────────┐ │ Selected 100 studies from top journals │ │ ↓ │ │ High-powered replications │ │ ↓ │ │ Pre-registered protocols │ │ ↓ │ │ Contacted original authors │ └─────────────────────────────────────────┘

RESULTS: ┌─────────────────────────────────────────┐ │ 97% of original studies: p < 0.05 │ │ (statistically significant) │ │ ↓ │ │ 36% of replications: p < 0.05 │ │ ↓ │ │ Effect sizes: ~50% of original │ │ ↓ │ │ 64% FAILED TO REPLICATE │ └─────────────────────────────────────────┘

PATTERN: ┌─────────────────────────────────────────┐ │ Surprising results less likely to │ │ replicate │ │ ↓ │ │ Novel findings = More likely false │ │ ↓ │ │ Exactly what gets published and cited │ └─────────────────────────────────────────┘

CANCER BIOLOGY: ┌─────────────────────────────────────────┐ │ Reproducibility Project: Cancer Biology │ │ (ongoing) │ │ ↓ │ │ High-profile papers │ │ ↓ │ │ Many failing to replicate │ │ ↓ │ │ Drug development based on unreplicable │ │ research │ └─────────────────────────────────────────┘

ECONOMICS: ┌─────────────────────────────────────────┐ │ 18 studies from top journals │ │ ↓ │ │ 11 replicated (61%) │ │ ↓ │ │ Better than psychology but still │ │ concerning │ └─────────────────────────────────────────┘

SOCIAL SCIENCES GENERALLY: ┌─────────────────────────────────────────┐ │ Replication rates: 40-70% depending on │ │ field │ │ ↓ │ │ Means: 30-60% of published findings are │ │ false │ └─────────────────────────────────────────┘

Half of published science doesn't replicate.

Let that sink in.

THE CAUSES: Why Science Broke

P-HACKING (Statistical Manipulation)

WHAT IT IS: ┌─────────────────────────────────────────┐ │ Running analyses until p < 0.05 │ │ ↓ │ │ "Researcher degrees of freedom": │ │ • Try different exclusion criteria │ │ • Test multiple outcomes │ │ • Stop collecting data when significant │ │ • Add covariates until significant │ │ ↓ │ │ Inflates false positive rate │ └─────────────────────────────────────────┘

EXAMPLE: ┌─────────────────────────────────────────┐ │ Study: Does listening to "When I'm 64" │ │ make you younger? │ │ ↓ │ │ Simmons et al. (2011): │ │ • Showed with p-hacking can "prove" │ │ absurd claims │ │ • Demonstrated how easy to get p < 0.05 │ │ ↓ │ │ Problem: This is routine practice │ └─────────────────────────────────────────┘

THE DAMAGE: ┌─────────────────────────────────────────┐ │ p < 0.05 supposed to mean: │ │ • 5% chance of false positive │ │ ↓ │ │ With p-hacking: │ │ • 50%+ chance of false positive │ │ ↓ │ │ Most "significant" results are noise │ └─────────────────────────────────────────┘

HARKING (Hypothesizing After Results Known): ┌─────────────────────────────────────────┐ │ Collect data first │ │ ↓ │ │ See what's significant │ │ ↓ │ │ Present as if hypothesis came first │ │ ↓ │ │ Circular: "Predicted" what you already │ │ observed │ └─────────────────────────────────────────┘

Statistical significance became goal, not tool.

PUBLICATION BIAS: The File Drawer Problem

THE BIAS: ┌─────────────────────────────────────────┐ │ Positive results: Published │ │ Negative results: Not published │ │ ↓ │ │ "File drawer": Failed studies never see │ │ light │ └─────────────────────────────────────────┘

THE CONSEQUENCES: ┌─────────────────────────────────────────┐ │ If 20 labs test hypothesis: │ │ • 1 gets p < 0.05 by chance (5%) │ │ • 19 get null results │ │ ↓ │ │ The 1 publishes │ │ The 19 don't │ │ ↓ │ │ Literature shows 100% support │ │ ↓ │ │ Reality: 95% null │ └─────────────────────────────────────────┘

EXAMPLE: CANDIDATE GENE STUDIES ┌─────────────────────────────────────────┐ │ Hundreds of genes "associated" with │ │ intelligence, personality, etc. │ │ ↓ │ │ Large-scale studies: Almost none │ │ replicate │ │ ↓ │ │ Decades of literature: Mostly false │ │ ↓ │ │ Because negative results weren't │ │ published │ └─────────────────────────────────────────┘

WHY IT HAPPENS: ┌─────────────────────────────────────────┐ │ Journals reject null results │ │ ↓ │ │ "Not interesting" │ │ ↓ │ │ Researchers don't submit null results │ │ ↓ │ │ Career impact: Need publications │ └─────────────────────────────────────────┘

Published literature is biased sample of all research.

UNDERPOWERED STUDIES: Too Small to Detect Real Effects

STATISTICAL POWER: ┌─────────────────────────────────────────┐ │ Power = Probability of detecting real │ │ effect │ │ ↓ │ │ Depends on: │ │ • Sample size │ │ • Effect size │ │ • Significance threshold │ └─────────────────────────────────────────┘

TYPICAL POWER IN PSYCHOLOGY: ┌─────────────────────────────────────────┐ │ Median power: ~35% │ │ ↓ │ │ Means: 65% chance of MISSING real │ │ effect │ │ ↓ │ │ And: High rate of false positives among │ │ "significant" results │ └─────────────────────────────────────────┘

THE PARADOX: ┌─────────────────────────────────────────┐ │ Low power + publication bias = │ │ ↓ │ │ Published "significant" findings mostly │ │ false │ │ ↓ │ │ Winner's curse: Published effect sizes │ │ inflated │ └─────────────────────────────────────────┘

WHY LOW POWER: ┌─────────────────────────────────────────┐ │ • Large samples expensive │ │ • Pressure to publish quickly │ │ • Multiple small studies > One large │ │ study (for CVs) │ │ ↓ │ │ Incentives favor quantity over quality │ └─────────────────────────────────────────┘

Studies too small to detect effects reliably.

PERVERSE INCENTIVES: Publish or Perish

THE ACADEMIC SYSTEM: ┌─────────────────────────────────────────┐ │ Career advancement requires: │ │ • Publications (many) │ │ • In top journals │ │ • Novel findings │ │ • High citations │ │ ↓ │ │ Quality secondary to quantity │ └─────────────────────────────────────────┘

THE PRESSURES: ┌─────────────────────────────────────────┐ │ Graduate students: Need papers to │ │ graduate │ │ ↓ │ │ Postdocs: Need papers for job market │ │ ↓ │ │ Assistant professors: Need papers for │ │ tenure │ │ ↓ │ │ Full professors: Need papers for grants │ │ ↓ │ │ Everyone: Publish or perish │ └─────────────────────────────────────────┘

WHAT THIS INCENTIVIZES: ┌─────────────────────────────────────────┐ │ ✓ Fast publication (don't check │ │ carefully) │ │ ✓ Novel results (surprising findings) │ │ ✓ Positive results (negative don't │ │ publish) │ │ ✓ Multiple papers from one dataset │ │ (salami slicing) │ │ ✗ Replication (not novel) │ │ ✗ Null results (not interesting) │ │ ✗ Careful work (too slow) │ │ ↓ │ │ System rewards bad practices │ └─────────────────────────────────────────┘

SPECIFIC EXAMPLES: ┌─────────────────────────────────────────┐ │ Diederik Stapel (psychology): │ │ • 50+ fraudulent papers │ │ • Career built on fabricated data │ │ • Caught because effects too good │ │ ↓ │ │ System rewarded prolific fraud │ │ (until exposed) │ └─────────────────────────────────────────┘

The system broke because incentives favor speed over accuracy.

PEER REVIEW FAILED

WHY PEER REVIEW DIDN'T CATCH THIS: ┌─────────────────────────────────────────┐ │ Reviewers don't: │ │ • See raw data │ │ • Replicate studies │ │ • Check statistical analyses thoroughly │ │ • Have time (unpaid work) │ │ ↓ │ │ Peer review catches obvious errors │ │ ↓ │ │ Doesn't catch: │ │ • P-hacking │ │ • HARKing │ │ • Publication bias │ │ • Underpowered studies │ │ • Subtle fraud │ └─────────────────────────────────────────┘

EXAMPLES OF FAILURES: ┌─────────────────────────────────────────┐ │ Stapel: 50+ fraudulent papers published │ │ ↓ │ │ Schön: 90+ physics papers (all fraud) │ │ ↓ │ │ Wakefield: Vaccines-autism (fraud) │ │ ↓ │ │ All passed peer review │ └─────────────────────────────────────────┘

Peer review is minimal filter, not rigorous verification.

(See Core #32 for full analysis)

THE CONSEQUENCES: What's at Stake

WASTED RESOURCES: ┌─────────────────────────────────────────┐ │ Pharmaceutical companies: │ │ • Millions spent on drug development │ │ • Based on unreplicable research │ │ • Drugs fail in trials │ │ ↓ │ │ Taxpayer money: │ │ • Funding false findings │ │ • Research building on false foundation │ └─────────────────────────────────────────┘

PATIENT HARM: ┌─────────────────────────────────────────┐ │ Medical treatments based on: │ │ • Unreplicable studies │ │ • May not work │ │ • May be harmful │ │ ↓ │ │ Real people affected │ └─────────────────────────────────────────┘

LOST TRUST: ┌─────────────────────────────────────────┐ │ Public skepticism of science │ │ ↓ │ │ "Studies show..." → "But will it │ │ replicate?" │ │ ↓ │ │ Undermines scientific authority │ └─────────────────────────────────────────┘

CAREER CASUALTIES: ┌─────────────────────────────────────────┐ │ Researchers who did careful work: │ │ • Disadvantaged │ │ ↓ │ │ System rewarded bad science │ │ ↓ │ │ Good scientists lost to broken │ │ incentives │ └─────────────────────────────────────────┘

The crisis has real costs.

CAN SCIENCE FIX ITSELF?

REFORMS (Ongoing):

PREREGISTRATION: ┌─────────────────────────────────────────┐ │ Register hypothesis + analysis plan │ │ BEFORE collecting data │ │ ↓ │ │ Prevents: HARKing, p-hacking │ │ ↓ │ │ Adoption: Growing but still minority │ └─────────────────────────────────────────┘

REGISTERED REPORTS: ┌─────────────────────────────────────────┐ │ Submit study design for peer review │ │ ↓ │ │ If approved: Guaranteed publication │ │ regardless of results │ │ ↓ │ │ Eliminates: Publication bias │ │ ↓ │ │ Some journals adopting │ └─────────────────────────────────────────┘

OPEN DATA/CODE: ┌─────────────────────────────────────────┐ │ Make data and code publicly available │ │ ↓ │ │ Allows: Verification, reanalysis │ │ ↓ │ │ Catches errors, fraud │ │ ↓ │ │ Increasing adoption │ └─────────────────────────────────────────┘

REPLICATION STUDIES: ┌─────────────────────────────────────────┐ │ Some journals now publish replications │ │ ↓ │ │ Grants for replication work │ │ ↓ │ │ But: Still undervalued │ └─────────────────────────────────────────┘

LARGER SAMPLES: ┌─────────────────────────────────────────┐ │ Push for adequately powered studies │ │ ↓ │ │ Collaborative large-scale studies │ │ ↓ │ │ Expensive but necessary │ └─────────────────────────────────────────┘

Reforms happening. Slowly.

CONCLUSION: Science's Self-Inflicted Crisis

The replication crisis wasn't caused by external attacks on science.

Scientists did this to themselves.

THE PROBLEM: ┌─────────────────────────────────────────┐ │ Incentive structure broke │ │ ↓ │ │ Publish or perish │ │ ↓ │ │ Novel findings rewarded │ │ ↓ │ │ Replication undervalued │ │ ↓ │ │ Peer review insufficient │ │ ↓ │ │ Result: Half of findings false │ └─────────────────────────────────────────┘

THE LESSON: ┌─────────────────────────────────────────┐ │ Science's hardening created rigidity │ │ ↓ │ │ Professionalization → Career pressures │ │ ↓ │ │ Metrics → Gaming metrics │ │ ↓ │ │ System optimized for publication │ │ ↓ │ │ Not for truth │ └─────────────────────────────────────────┘

Can science fix itself?

Maybe. Reforms are happening:

✓ Preregistration
✓ Registered reports
✓ Open data
✓ Replication efforts

But fundamental problem remains: Career incentives.

Until replication is rewarded as much as novelty, until careful work is valued as much as fast publication, until truth matters more than metrics—the crisis will persist.

The replication crisis revealed:

Science's methods work—when practiced rigorously.

But the system incentivizes cutting corners.

And half the time, the corners cut lead to false conclusions.

Published. Peer-reviewed. Cited. Wrong.

That's the crisis.

[Cross-references: For peer review failures, see "Peer Review: The Flawed Mechanism That Still Works" (Core #32). For publish-or-perish pressures, see "When Science Became a Job: Professionalization" (Core #31) and "Publish or Perish: How Career Incentives Broke Science" (Core #43). For p-hacking details, see "P-Hacking and Statistical Fraud: Gaming the System" (Core #41). For reform efforts, see "Preregistration and Reforms: Can Science Fix Itself?" (Core #45). For specific fraud cases, see "Peer Review's Failure: Bias, Fraud, and Breakdown" (Core #42).]