A practical guide to reliable results

Published on February 20, 2026

Modern science produces an enormous volume of results, but volume is not the same thing as truth. The uncomfortable part is that many published claims are fragile, not because researchers are incompetent, but because the system rewards novelty, speed, and clean stories. When you combine small samples, flexible analysis choices, and selective reporting, you can generate results that look statistically convincing and still fail to hold up.

John Ioannidis made this argument famous by formalizing why many published findings are likely to be false positives when power is low, bias exists, and many hypotheses are tested. His paper is worth reading because it explains the failure as a predictable outcome of incentives and statistics, not as a moral flaw. https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124

One of the simplest sources of fragility is analytical flexibility. If you try multiple ways to clean data, multiple outcomes, multiple covariate sets, multiple subgroup definitions, and you only report what worked, you can unintentionally convert noise into a publishable story. Preregistration exists to make that flexibility visible by separating what you planned from what you discovered after looking at the data. Brian Nosek and colleagues describe preregistration as a way to preserve the distinction between prediction and postdiction, which improves calibration and credibility. https://www.pnas.org/doi/10.1073/pnas.1708274114

Another source of fragility is multiple testing. In genomics, imaging, and many modern fields, you are often testing thousands to millions of hypotheses at once. If you treat p less than 0.05 as a discovery threshold without correction, you will generate a large number of false positives by construction. Benjamini and Hochberg’s false discovery rate procedure became foundational because it offers a practical way to control false discoveries while keeping more power than family wise error approaches in many settings. https://academic.oup.com/jrsssb/article/57/1/289/7035855

This is why the scientific community periodically revisits what we mean by statistical evidence. A prominent proposal argued that for claims of new discoveries, the default p value threshold should be more stringent, moving from 0.05 to 0.005, specifically to reduce false positives in fields where marginal findings dominate. You do not have to agree with the exact cutoff to see the point: weak evidence plus flexible workflows produces a lot of claims that do not replicate. https://pubmed.ncbi.nlm.nih.gov/30980045/

The practical takeaway is not that science is broken. The practical takeaway is that reliability is an engineering problem. If you want results that survive contact with reality, you need routines that reduce researcher degrees of freedom, document what you did, and penalize yourself for storytelling. Preregister what you can. Use multiple testing control when you are in high dimensional hypothesis space. Report effect sizes and uncertainty, not just significance. Treat exploratory findings as exploratory, and demand replication or external validation before you talk like something is established.

There is a quiet benefit to adopting these habits. They reduce the cognitive load of research. When your workflow forces you to label what was planned, what was exploratory, and what is robust to reasonable variations, you stop arguing with yourself and you stop arguing with reviewers about vibes. You can point to a clean chain of reasoning and a documented path from question to result. That is what credible science looks like in a world where anyone can generate impressive looking analyses in an afternoon.