The Reproducibility Crisis: When Science Couldn't Replicate Itself

Amgen, 2012. Pharmaceutical company scientists attempt to replicate 53 "landmark" studies in cancer biology—papers published in top journals, cited hundreds of times, foundational to their field.

Their goal: Validate these findings before investing millions in drug development based on them.

They could reproduce only 6.

6 out of 53.

That's an 89% failure rate.

Not minor details. Not edge cases. Core findings. Published. Peer-reviewed. Highly cited.

Same year, Bayer (another pharmaceutical giant) tries replicating 67 studies.

They replicate fewer than 25%.

2015, Open Science Collaboration attempts to replicate 100 psychology studies from top journals.

Only 36% replicate successfully.

This wasn't sloppiness. This was systemic.

Welcome to the reproducibility crisis—the moment science discovered it couldn't trust its own published results. When the hardening cracked. When falsification required not just testing hypotheses against nature, but testing other scientists' claims against reality.

The foundation of science: Results should be reproducible. If you follow the same methods, you should get the same findings. If you can't reproduce it, it's not real knowledge—it's noise.

The reality circa 2010s: Huge chunks of published science couldn't be reproduced. Not because nature changed. Because the methods were flawed, the statistics were gamed, the incentives were broken.

Let's examine how science's institutional structures created a reproducibility crisis, why it took so long to notice, what fields are most affected, and whether science can fix itself before credibility collapses completely.

HOW DID WE NOT NOTICE THIS EARLIER?

THE TRUST ASSUMPTION (Pre-2000s)

OLD MODEL: ┌─────────────────────────────────────────┐ │ Scientist publishes result │ │ ↓ │ │ Peer review filters obvious errors │ │ ↓ │ │ Journal publishes │ │ ↓ │ │ Community accepts (mostly) │ │ ↓ │ │ Nobody systematically tries to replicate│ │ (Why would you? It's published!) │ └─────────────────────────────────────────┘

WHY REPLICATION WAS RARE: ┌─────────────────────────────────────────┐ │ • No incentive: Can't publish │ │ "We confirmed X" (journals want novel │ │ findings) │ │ ↓ │ │ • No funding: Grants for new research, │ │ not validation │ │ ↓ │ │ • No credit: Replication doesn't advance│ │ your career │ │ ↓ │ │ • Culture: Questioning published work = │ │ hostile, uncollegial │ │ ↓ │ │ Result: Published = assumed true │ └─────────────────────────────────────────┘

WHEN IT BREAKS: ┌─────────────────────────────────────────┐ │ Amgen/Bayer need reproducible findings │ │ (can't develop drugs on false results) │ │ ↓ │ │ Try to replicate systematically │ │ ↓ │ │ Most fail │ │ ↓ │ │ Crisis becomes visible │ └─────────────────────────────────────────┘

Nobody was checking.

Not systematically. Not rigorously. Published meant true.

Until pharma companies started checking—and found most of it was wrong.

THE MECHANISMS: How False Results Get Published

PATH TO PUBLICATION (The Broken System)

STEP 1: P-HACKING ┌─────────────────────────────────────────┐ │ Researcher collects data │ │ ↓ │ │ Tests hypothesis │ │ ↓ │ │ Result not significant (p > 0.05) │ │ ↓ │ │ Options: │ │ a) Report null result (unpublishable) │ │ b) Try different analyses until │ │ something is significant │ │ ↓ │ │ Try: Different subgroups, different │ │ variables, different statistical tests │ │ ↓ │ │ Eventually find p < 0.05 │ │ ↓ │ │ Publish this (hide 20 failed attempts) │ └─────────────────────────────────────────┘

STEP 2: HARKing (Hypothesizing After Results Known) ┌─────────────────────────────────────────┐ │ Collect data exploratively (no specific │ │ hypothesis) │ │ ↓ │ │ Find interesting pattern │ │ ↓ │ │ Write paper as if you predicted it │ │ ↓ │ │ "We hypothesized that..." │ │ ↓ │ │ Looks like confirmatory (it's not) │ └─────────────────────────────────────────┘

STEP 3: PUBLICATION BIAS ┌─────────────────────────────────────────┐ │ Journals want "significant" findings │ │ ↓ │ │ Null results (no effect) rejected │ │ ↓ │ │ Only positive findings published │ │ ↓ │ │ Literature: Biased sample │ │ ↓ │ │ Gives illusion effects are real/strong │ └─────────────────────────────────────────┘

STEP 4: FILE-DRAWER PROBLEM ┌─────────────────────────────────────────┐ │ 10 labs test same hypothesis │ │ ↓ │ │ 1 finds significant effect (by chance) │ │ ↓ │ │ That one publishes │ │ ↓ │ │ Other 9: File drawer (never published) │ │ ↓ │ │ Literature shows "strong evidence" │ │ (really 1/10 success rate) │ └─────────────────────────────────────────┘

STEP 5: LOW STATISTICAL POWER ┌─────────────────────────────────────────┐ │ Small sample sizes (n=20) │ │ ↓ │ │ Low power to detect real effects │ │ ↓ │ │ But: High false positive rate │ │ ↓ │ │ Published "significant" results often │ │ false │ └─────────────────────────────────────────┘

Every step selects for false positives.

CASE STUDY 1: Psychology's Replication Failure

OPEN SCIENCE COLLABORATION (2015)

THE ATTEMPT: ┌─────────────────────────────────────────┐ │ Select 100 studies from top psychology │ │ journals (2008) │ │ ↓ │ │ Replicate with: │ │ • Larger sample sizes │ │ • Transparent methods │ │ • Pre-registered hypotheses │ │ ↓ │ │ Test: Do we get same results? │ └─────────────────────────────────────────┘

RESULTS: ┌─────────────────────────────────────────┐ │ Original studies: 97% statistically │ │ significant │ │ ↓ │ │ Replications: Only 36% significant │ │ ↓ │ │ Effect sizes: Average 50% smaller in │ │ replications │ │ ↓ │ │ Conclusion: Most published psychology │ │ effects overstated or false │ └─────────────────────────────────────────┘

SPECIFIC EXAMPLES: ┌─────────────────────────────────────────┐ │ "Power poses" (Amy Cuddy): High-impact │ │ finding (standing powerfully changes │ │ hormones, behavior). Multiple │ │ replications: Failed │ │ ↓ │ │ "Facial feedback hypothesis": Smiling │ │ makes you happier. Large replication: │ │ Failed │ │ ↓ │ │ "Ego depletion": Willpower depletes │ │ like muscle. Replications: Mixed/weak │ └─────────────────────────────────────────┘

WHY PSYCHOLOGY HIT HARD: ┌─────────────────────────────────────────┐ │ • Small effects (hard to detect) │ │ • Noisy measurements (human behavior) │ │ • Researcher degrees of freedom │ │ (many analysis choices) │ │ • Publication bias (only positive │ │ results) │ └─────────────────────────────────────────┘

Famous findings. TED talks. Textbooks.

Couldn't replicate.

CASE STUDY 2: Biomedical Research—The Amgen Shock

BEGLEY & ELLIS (2012)

THE CONTEXT: ┌─────────────────────────────────────────┐ │ Amgen: Developing cancer drugs │ │ ↓ │ │ Need: Reproducible biological │ │ mechanisms to target │ │ ↓ │ │ Strategy: Build on published "landmark" │ │ studies │ └─────────────────────────────────────────┘

THE REPLICATION ATTEMPT: ┌─────────────────────────────────────────┐ │ Selected 53 "landmark" papers: │ │ • Top journals (Nature, Cell, Science) │ │ • High citations │ │ • Fundamental findings │ │ ↓ │ │ Attempted exact replication │ │ ↓ │ │ Even contacted original authors for │ │ clarification │ └─────────────────────────────────────────┘

RESULTS: ┌─────────────────────────────────────────┐ │ Successfully reproduced: 6/53 (11%) │ │ ↓ │ │ That means: 89% FAILURE RATE │ │ ↓ │ │ Not "slightly different"—completely │ │ failed to reproduce core findings │ └─────────────────────────────────────────┘

BAYER (Same Period): ┌─────────────────────────────────────────┐ │ 67 studies tested │ │ ↓ │ │ <25% reproduced │ │ ↓ │ │ 75% failure rate │ └─────────────────────────────────────────┘

THE COST: ┌─────────────────────────────────────────┐ │ Pharmaceutical companies waste millions │ │ developing drugs based on false biology │ │ ↓ │ │ Patients: Delayed real treatments │ │ ↓ │ │ Science: Credibility damaged │ └─────────────────────────────────────────┘

Cancer biology. The highest-stakes field.

Couldn't reproduce its own foundational findings.

WHY IT HAPPENED: The Incentive Problem

BROKEN INCENTIVE STRUCTURE

WHAT GETS REWARDED: ┌─────────────────────────────────────────┐ │ • Novel findings (not confirmations) │ │ • Positive results (not null findings) │ │ • High-impact publications (Nature, │ │ Science) │ │ • Large effect sizes (dramatic claims) │ │ • Frequent publications (quantity) │ └─────────────────────────────────────────┘

CAREER CONSEQUENCES: ┌─────────────────────────────────────────┐ │ PUBLISHING NULL RESULTS: │ │ • Hard to publish │ │ • Seen as "failure" │ │ • Damages career prospects │ │ ↓ │ │ REPLICATING OTHERS: │ │ • Not publishable │ │ • No grants │ │ • Wastes career time │ │ ↓ │ │ BEING CAREFUL/RIGOROUS: │ │ • Fewer publications │ │ • Slower output │ │ • Career disadvantage vs. competitors │ └─────────────────────────────────────────┘

THE PRESSURE: ┌─────────────────────────────────────────┐ │ "Publish or perish" │ │ ↓ │ │ Tenure requires publications │ │ ↓ │ │ Grants require preliminary data │ │ ↓ │ │ Jobs require high-impact papers │ │ ↓ │ │ Incentive: Produce "significant" results│ │ by any means necessary │ └─────────────────────────────────────────┘

RESULT: ┌─────────────────────────────────────────┐ │ Scientists optimizing for: │ │ • Publications (not truth) │ │ • Impact factor (not validity) │ │ • Novelty (not robustnessA system's ability to absorb shocks without losing its core function. Robustness is often purchased through redundancy and slack.) │ │ ↓ │ │ System selects for false positives │ └─────────────────────────────────────────┘

The system rewards producing dramatic, novel claims.

Not producing true, validated knowledge.

Scientists aren't fraudulent—they're responding rationally to perverse incentives.

WHICH FIELDS ARE WORST HIT?

REPRODUCIBILITY BY FIELD

CRISIS LEVEL BY DISCIPLINE: ┌─────────────────────────────────────────┐ │ SEVERE CRISIS: │ │ • Psychology (~36% replication rate) │ │ • Biomedical research (~11-25%) │ │ • Cancer biology (~11%) │ │ • Preclinical research (very low) │ │ ↓ │ │ MODERATE CRISIS: │ │ • Economics (mixed results) │ │ • Neuroscience (improving) │ │ • Ecology (some problems) │ │ ↓ │ │ LESS AFFECTED: │ │ • Physics (higher replication) │ │ • Chemistry (better reproducibility) │ │ • Astronomy (shared data) │ └─────────────────────────────────────────┘

WHY THE VARIATION? ┌─────────────────────────────────────────┐ │ PHYSICS/CHEMISTRY BETTER BECAUSE: │ │ • Simpler systems │ │ • Cleaner measurements │ │ • Stronger theory │ │ • Culture of replication │ │ • Shared instruments/standards │ │ ↓ │ │ PSYCHOLOGY/BIOLOGY WORSE BECAUSE: │ │ • Complex systems │ │ • Noisy measurements │ │ • Weak theory │ │ • Many researcher degrees of freedom │ │ • Small samples │ └─────────────────────────────────────────┘

THE PATTERN: ┌─────────────────────────────────────────┐ │ Harder the system to study │ │ ↓ │ │ More researcher flexibility in analysis │ │ ↓ │ │ More ways to p-hack │ │ ↓ │ │ Worse reproducibility crisis │ └─────────────────────────────────────────┘

"Soft" sciences hit hardest.

Not because scientists are worse—because the subject matter is harder, the methods less constrained, the room for self-deception larger.

THE FRAUD CASES: When It's Worse Than Sloppiness

FAMOUS FRAUD CASES (2000s-2010s)

DIEDERIK STAPEL (Social Psychology): ┌─────────────────────────────────────────┐ │ Published 58 papers │ │ ↓ │ │ 2011: Exposed for fabricating data │ │ ↓ │ │ Made up entire datasets │ │ ↓ │ │ 58 papers retracted │ │ ↓ │ │ Nobody noticed for years (data looked │ │ too good to check) │ └─────────────────────────────────────────┘

JAN HENDRIK SCHÖN (Physics): ┌─────────────────────────────────────────┐ │ Published ~90 papers (3 years) │ │ ↓ │ │ "Breakthroughs" in molecular electronics│ │ ↓ │ │ 2002: Exposed—fabricated data │ │ ↓ │ │ 28 papers retracted │ │ ↓ │ │ Used same graph for different │ │ experiments (copy-paste error revealed │ │ fraud) │ └─────────────────────────────────────────┘

YOSHITAKA FUJII (Anesthesiology): ┌─────────────────────────────────────────┐ │ 183 papers retracted │ │ ↓ │ │ Clinical trials: Fabricated patient data│ │ ↓ │ │ Statistical impossibilities in results │ │ ↓ │ │ Largest retraction in medical history │ └─────────────────────────────────────────┘

HWANG WOO-SUK (Stem Cell Biology): ┌─────────────────────────────────────────┐ │ Claimed: Human cloning breakthrough │ │ ↓ │ │ Published in Science (2004-2005) │ │ ↓ │ │ 2006: Exposed—fabricated data │ │ ↓ │ │ Set back legitimate stem cell research │ └─────────────────────────────────────────┘

These aren't the reproducibility crisis.

These are outright fraud.

But they reveal: If you fabricate data cleverly, peer review won't catch it. Only replication attempts do.

And nobody was replicating.

THE SOLUTIONS: Can Science Fix Itself?

REFORM EFFORTS (2010s-Present)

1. PRE-REGISTRATION: ┌─────────────────────────────────────────┐ │ Register hypothesis, methods, analysis │ │ plan BEFORE collecting data │ │ ↓ │ │ Prevents HARKing, p-hacking │ │ ↓ │ │ Forces transparency │ │ ↓ │ │ Status: Growing adoption, not universal │ └─────────────────────────────────────────┘

2. OPEN DATA: ┌─────────────────────────────────────────┐ │ Publish raw data with papers │ │ ↓ │ │ Allows reanalysis, error detection │ │ ↓ │ │ Journals increasingly require this │ │ ↓ │ │ Status: Improving, but incomplete │ └─────────────────────────────────────────┘

3. REPLICATION STUDIES: ┌─────────────────────────────────────────┐ │ Journals publishing replication attempts│ │ ↓ │ │ Funding for replication research │ │ ↓ │ │ Career credit for replicating │ │ ↓ │ │ Status: Growing but still limited │ └─────────────────────────────────────────┘

4. REGISTERED REPORTS: ┌─────────────────────────────────────────┐ │ Journals accept/reject based on methods │ │ BEFORE results known │ │ ↓ │ │ Removes publication bias │ │ ↓ │ │ Null results get published │ │ ↓ │ │ Status: Some journals, growing │ └─────────────────────────────────────────┘

5. STATISTICAL REFORM: ┌─────────────────────────────────────────┐ │ Stricter significance thresholds │ │ (p < 0.005 instead of 0.05) │ │ ↓ │ │ Bayesian methods │ │ ↓ │ │ Effect size reporting │ │ ↓ │ │ Status: Debated, slowly adopting │ └─────────────────────────────────────────┘

6. REPLICATION PLATFORMS: ┌─────────────────────────────────────────┐ │ Many Labs projects │ │ ↓ │ │ Coordinated multi-lab replications │ │ ↓ │ │ Psychological Science Accelerator │ │ ↓ │ │ Status: Successful but limited scale │ └─────────────────────────────────────────┘

Progress is happening.

But it's slow, patchy, and resisted.

THE RESISTANCE: Why Reform Is Hard

OBSTACLES TO REFORM

INDIVIDUAL LEVEL: ┌─────────────────────────────────────────┐ │ • Career incentives unchanged │ │ (still need high-impact publications) │ │ ↓ │ │ • Transparency threatens competitive │ │ advantage │ │ ↓ │ │ • Admitting past practices problematic │ │ threatens reputation │ └─────────────────────────────────────────┘

INSTITUTIONAL LEVEL: ┌─────────────────────────────────────────┐ │ • Universities hire/promote based on │ │ publication metrics │ │ ↓ │ │ • Journals profit from sensational │ │ findings (increase impact factor) │ │ ↓ │ │ • Funding agencies reward productivity │ │ over rigor │ └─────────────────────────────────────────┘

CULTURAL LEVEL: ┌─────────────────────────────────────────┐ │ • "Replication police" seen as hostile │ │ ↓ │ │ • Questioning published work = attacking│ │ colleagues │ │ ↓ │ │ • Null results = boring/failure │ └─────────────────────────────────────────┘

THE TRAP: ┌─────────────────────────────────────────┐ │ Researchers who adopt rigorous practices│ │ publish less │ │ ↓ │ │ Publishing less = career damage │ │ ↓ │ │ Rigorous researchers lose to less │ │ rigorous competitors │ │ ↓ │ │ System selects against reform │ └─────────────────────────────────────────┘

Individual scientists can't fix it alone.

The incentive structure needs to change.

And institutions are slow to change.

WHAT'S AT STAKE: Science's Credibility

THE CONSEQUENCES

FOR SCIENCE: ┌─────────────────────────────────────────┐ │ • Public trust declining │ │ • Ammunition for science deniers │ │ • Wasted research dollars │ │ • Delayed real discoveries │ │ • Textbooks teaching false findings │ └─────────────────────────────────────────┘

FOR MEDICINE: ┌─────────────────────────────────────────┐ │ • Drug development based on false │ │ biology │ │ ↓ │ │ • Clinical trials that will fail │ │ ↓ │ │ • Delayed treatments for patients │ │ ↓ │ │ • Billions wasted │ └─────────────────────────────────────────┘

FOR SOCIETY: ┌─────────────────────────────────────────┐ │ • Policy based on irreproducible science│ │ ↓ │ │ • Education teaching false findings │ │ ↓ │ │ • Erosion of scientific authority │ │ ↓ │ │ • "If they can't replicate, why trust │ │ science?" │ └─────────────────────────────────────────┘

The reproducibility crisis isn't just internal housekeeping.

It's an existential threat to science's social contract.

We fund science because it produces reliable knowledge.

If it doesn't produce reliable knowledge, why fund it?

CONCLUSION: Hardening Requires Checking

The reproducibility crisis reveals something fundamental about science:

Falsification isn't automatic.

Publishing a finding doesn't make it true. Peer review doesn't guarantee validity. Citations don't equal correctness.

Knowledge only hardens when it's tested repeatedly, skeptically, by independent researchers.

For 50+ years, science operated on trust: "Published = true."

That trust was misplaced.

The reforms:

Pre-registration (commit to methods before results)
Open data (transparency)
Replication studies (systematic checking)
Registered reports (remove publication bias)
Statistical reform (higher standards)

These are working—slowly.

But the deeper problem remains: Science's incentive structure rewards novelty and productivity over truth and rigor.

Until that changes, the crisis continues.

The paradox:

Science has never been more sophisticated (CRISPR, AI, quantum computers, gravitational waves).

But we can't reproduce basic findings in psychology and cancer biology.

High-tech. Low-rigor.

The hardening of science required not just clever experiments, but institutional mechanisms to separate truth from error.

Those mechanisms broke.

And science is still figuring out how to fix them.

Because if science can't replicate itself, it's just expensive storytelling.

[Cross-references: For peer review as gatekeeper, see "When Journals Became Gatekeepers: Controlling Scientific Truth" (Core #42). For funding shaping research, see "When Funding Shaped Questions: Science as Investment" (Core #43). For statistical methods and p-values, see Mathematics Companion #134-136. For pharmaceutical research incentives, see "When Science Became Useful: Industrial Research" (Core #34). For fraud detection and scientific integrity, see "Flawed Mechanisms That Still Work: Error Correction in Science" (Core #32). For psychology's specific issues, see Biology Companion #110-111. For open science movements, see "What Comes After Falsification? New Epistemologies" (Core #48).]