The ASA's Statement on pValues: Context, Process, and Purpose
In February 2014, George Cobb, Professor Emeritus of Mathematics and Statistics at Mount Holyoke College, posed these questions to an ASA discussion forum:
Q: Why do so many colleges and grad schools teach p = 0.05?
A: Because that's still what the scientific community and journal editors use.
Q: Why do so many people still use p = 0.05?
A: Because that's what they were taught in college or grad school.
Cobb's concern was a longworrisome circularity in the sociology of science based on the use of bright lines such as p < 0.05: “We teach it because it's what we do; we do it because it's what we teach.” This concern was brought to the attention of the ASA Board.
The ASA Board was also stimulated by highly visible discussions over the last few years. For example, ScienceNews (Siegfried ) wrote: “It's science's dirtiest secret: The ‘scientific method’ of testing hypotheses by statistical analysis stands on a flimsy foundation.” A November 2013, article in Phys.org Science News Wire () cited “numerous deep flaws” in null hypothesis significance testing. A ScienceNews article (Siegfried ) on February 7, 2014, said “statistical techniques for testing hypotheses…have more flaws than Facebook's privacy policies.” A week later, statistician and “Simply Statistics” blogger Jeff Leek responded. “The problem is not that people use Pvalues poorly,” Leek wrote, “it is that the vast majority of data analysis is not performed by people properly trained to perform data analysis” (Leek ). That same week, statistician and science writer Regina Nuzzo published an article in Nature entitled “Scientific Method: Statistical Errors” (Nuzzo ). That article is now one of the most highly viewed Nature articles, as reported by altmetric.com (http://www.altmetric.com/details/2115792#score).
.....
3. Principles
Pvalues can indicate how incompatible the data are with a specified statistical model.
A pvalue provides one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data. The most common context is a model, constructed under a set of assumptions, together with a socalled “null hypothesis.” Often the null hypothesis postulates the absence of an effect, such as no difference between two groups, or the absence of a relationship between a factor and an outcome. The smaller the pvalue, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the pvalue hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.
Pvalues do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Researchers often wish to turn a pvalue into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The pvalue is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.
Scientific conclusions and business or policy decisions should not be based only on whether a pvalue passes a specific threshold.
Practices that reduce data analysis or scientific inference to mechanical “brightline” rules (such as “p < 0.05”) for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision making. A conclusion does not immediately become “true” on one side of the divide and “false” on the other. Researchers should bring many contextual factors into play to derive scientific inferences, including the design of a study, the quality of the measurements, the external evidence for the phenomenon under study, and the validity of assumptions that underlie the data analysis. Pragmatic considerations often require binary, “yesno” decisions, but this does not mean that pvalues alone can ensure that a decision is correct or incorrect. The widespread use of “statistical significance” (generally interpreted as “p ≤ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.
Proper inference requires full reporting and transparency
Pvalues and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain pvalues (typically those passing a significance threshold) renders the reported pvalues essentially uninterpretable. Cherrypicking promising findings, also known by such terms as data dredging, significance chasing, significance questing, selective inference, and “phacking,” leads to a spurious excess of statistically significant results in the published literature and should be vigorously avoided. One need not formally carry out multiple statistical tests for this problem to arise: Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis. Researchers should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted, and all pvalues computed. Valid scientific conclusions based on pvalues and related statistics cannot be drawn without at least knowing how many and which analyses were conducted, and how those analyses (including pvalues) were selected for reporting.
A pvalue, or statistical significance, does not measure the size of an effect or the importance of a result.
Statistical significance is not equivalent to scientific, human, or economic significance. Smaller pvalues do not necessarily imply the presence of larger or more important effects, and larger pvalues do not imply a lack of importance or even lack of effect. Any effect, no matter how tiny, can produce a small pvalue if the sample size or measurement precision is high enough, and large effects may produce unimpressive pvalues if the sample size is small or measurements are imprecise. Similarly, identical estimated effects will have different pvalues if the precision of the estimates differs.
By itself, a pvalue does not provide a good measure of evidence regarding a model or hypothesis.
Researchers should recognize that a pvalue without context or other evidence provides limited information. For example, a pvalue near 0.05 taken by itself offers only weak evidence against the null hypothesis. Likewise, a relatively large pvalue does not imply evidence in favor of the null hypothesis; many other hypotheses may be equally or more consistent with the observed data. For these reasons, data analysis should not end with the calculation of a pvalue when other approaches are appropriate and feasible.
