Actuarial Outpost
 
Go Back   Actuarial Outpost > Actuarial Discussion Forum > Risk Management
FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions

DW Simpson International Actuarial Jobs
Canada  Asia  Australia  Bermuda  Latin America  Europe


Reply
 
Thread Tools Display Modes
  #11  
Old 06-16-2015, 10:47 AM
campbell's Avatar
campbell campbell is offline
Mary Pat Campbell
SOA AAA
 
Join Date: Nov 2003
Location: NY
Studying for duolingo and coursera
Favorite beer: Murphy's Irish Stout
Posts: 76,824
Blog Entries: 6
Default

Yes, fishing for nonexistent patterns is pretty much a problem with stats.
Reply With Quote
  #12  
Old 06-16-2015, 10:49 AM
campbell's Avatar
campbell campbell is offline
Mary Pat Campbell
SOA AAA
 
Join Date: Nov 2003
Location: NY
Studying for duolingo and coursera
Favorite beer: Murphy's Irish Stout
Posts: 76,824
Blog Entries: 6
Default

related:

https://www.sciencebasedmedicine.org...cance-testing/

Quote:
This is perhaps the first real crack in the wall for the almost-universal use of the null hypothesis significance testing procedure (NHSTP). The journal, Basic and Applied Social Psychology (BASP), has banned the use of NHSTP and related statistical procedures from their journal. They previously had stated that use of these statistical methods was no longer required but can be optional included. Now they have proceeded to a full ban.

The type of analysis being banned is often called a frequentist analysis, and we have been highly critical in the pages of SBM of overreliance on such methods. This is the iconic p-value where <0.05 is generally considered to be statistically significant.

The process of hypothesis testing and rigorous statistical methods for doing so were worked out in the 1920s. Ronald Fisher developed the statistical methods, while Jerzy Neyman and Egon Pearson developed the process of hypothesis testing. They certainly deserve a great deal of credit for their role in crafting modern scientific procedures and making them far more quantitative and rigorous.

However, the p-value was never meant to be the sole measure of whether or not a particular hypothesis is true. Rather it was meant only as a measure of whether or not the data should be taken seriously. Further, the p-value is widely misunderstood. The precise definition is:

The p value is the probability to obtain an effect equal to or more extreme than the one observed presuming the null hypothesis of no effect is true.

In other words, it is the probability of the data given the null hypothesis. However, it is often misunderstood to be the probability of the hypothesis given the data. The editors understand that the journey from data to hypothesis is a statistical inference, and one that in practice has turned out to be more misleading than informative. It encourages lazy thinking – if you reach the magical p-value then your hypothesis is true. They write:

In the NHSTP, the problem is in traversing the distance from the probability of the finding, given the null hypothesis, to the probability of the null hypothesis, given the finding. Regarding confidence intervals, the problem is that, for example, a 95% confidence interval does not indicate that the parameter of interest has a 95% probability of being within the interval. Rather, it means merely that if an infinite number of samples were taken and confidence intervals computed, 95% of the confidence intervals would capture the population parameter. Analogous to how the NHSTP fails to provide the probability of the null hypothesis, which is needed to provide a strong case for rejecting it, confidence intervals do not provide a strong case for concluding that the population parameter of interest is likely to be within the stated interval.

Another problem with the p-value is that it is not highly replicable. This is demonstrated nicely by Geoff Cumming as illustrated with a video. He shows, using computer simulation, that if one study achieves a p-value of 0.05, this does not predict that an exact replication will also yield the same p-value. Using the p-value as the final arbiter of whether or not to accept or reject the null hypothesis is therefore highly unreliable.

Cumming calls this the “dance of the p-value,” because, as you can see in his video, when you repeat a virtual experiment with a phenomenon of known size, the p-values that result from the data collection dance all over the place.

Regina Nuzzo, writing in Nature in 2014, echoes these concerns. She points out that if an experiment results in a p-value of 0.01, the probability of an exact replication also achieving a p-value of 0.01 (this all assumes perfect methodology and no cheating) is 50%, not 99% as many might falsely assume.

The real world problem is worse than these pure statistics would suggest, because of a phenomenon known as p-hacking. In 2011 Simmons et al. published a paper in Psychological Science in which they demonstrate that exploiting common researcher degrees of freedom could easily manipulate the data (even innocently) to achieve the threshold p-value of 0.05. They point out that published p-values cluster suspiciously around this 0.05 level, suggesting that some degree of p-hacking is going on.

This is also often described as torturing the data until it confesses. In a 2009 systematic review, 33.7% of scientists surveyed admitted to engaging in questionable research practices – such as those that result in p-hacking. The temptation is simply too great, and the rationalizations too easy – I’ll just keep collecting data until it wanders randomly over the 0.05 p-value level, and then stop. One might argue that overreliance on the p-value as a gold standard of what is publishable encourages p-hacking.

Reply With Quote
  #13  
Old 12-17-2016, 06:42 PM
campbell's Avatar
campbell campbell is offline
Mary Pat Campbell
SOA AAA
 
Join Date: Nov 2003
Location: NY
Studying for duolingo and coursera
Favorite beer: Murphy's Irish Stout
Posts: 76,824
Blog Entries: 6
Default

http://amstat.tandfonline.com/doi/ab...5.2016.1154108

Quote:
The ASA's Statement on p-Values: Context, Process, and Purpose


In February 2014, George Cobb, Professor Emeritus of Mathematics and Statistics at Mount Holyoke College, posed these questions to an ASA discussion forum:

Q: Why do so many colleges and grad schools teach p = 0.05?

A: Because that's still what the scientific community and journal editors use.

Q: Why do so many people still use p = 0.05?

A: Because that's what they were taught in college or grad school.

Cobb's concern was a long-worrisome circularity in the sociology of science based on the use of bright lines such as p < 0.05: “We teach it because it's what we do; we do it because it's what we teach.” This concern was brought to the attention of the ASA Board.

The ASA Board was also stimulated by highly visible discussions over the last few years. For example, ScienceNews (Siegfried ) wrote: “It's science's dirtiest secret: The ‘scientific method’ of testing hypotheses by statistical analysis stands on a flimsy foundation.” A November 2013, article in Phys.org Science News Wire () cited “numerous deep flaws” in null hypothesis significance testing. A ScienceNews article (Siegfried ) on February 7, 2014, said “statistical techniques for testing hypotheses…have more flaws than Facebook's privacy policies.” A week later, statistician and “Simply Statistics” blogger Jeff Leek responded. “The problem is not that people use P-values poorly,” Leek wrote, “it is that the vast majority of data analysis is not performed by people properly trained to perform data analysis” (Leek ). That same week, statistician and science writer Regina Nuzzo published an article in Nature entitled “Scientific Method: Statistical Errors” (Nuzzo ). That article is now one of the most highly viewed Nature articles, as reported by altmetric.com (http://www.altmetric.com/details/2115792#score).

.....
3. Principles

P-values can indicate how incompatible the data are with a specified statistical model.

A p-value provides one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data. The most common context is a model, constructed under a set of assumptions, together with a so-called “null hypothesis.” Often the null hypothesis postulates the absence of an effect, such as no difference between two groups, or the absence of a relationship between a factor and an outcome. The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.


P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.


Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
Practices that reduce data analysis or scientific inference to mechanical “bright-line” rules (such as “p < 0.05”) for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision making. A conclusion does not immediately become “true” on one side of the divide and “false” on the other. Researchers should bring many contextual factors into play to derive scientific inferences, including the design of a study, the quality of the measurements, the external evidence for the phenomenon under study, and the validity of assumptions that underlie the data analysis. Pragmatic considerations often require binary, “yes-no” decisions, but this does not mean that p-values alone can ensure that a decision is correct or incorrect. The widespread use of “statistical significance” (generally interpreted as “p ≤ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.


Proper inference requires full reporting and transparency
P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (typically those passing a significance threshold) renders the reported p-values essentially uninterpretable. Cherry-picking promising findings, also known by such terms as data dredging, significance chasing, significance questing, selective inference, and “p-hacking,” leads to a spurious excess of statistically significant results in the published literature and should be vigorously avoided. One need not formally carry out multiple statistical tests for this problem to arise: Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis. Researchers should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted, and all p-values computed. Valid scientific conclusions based on p-values and related statistics cannot be drawn without at least knowing how many and which analyses were conducted, and how those analyses (including p-values) were selected for reporting.


A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
Statistical significance is not equivalent to scientific, human, or economic significance. Smaller p-values do not necessarily imply the presence of larger or more important effects, and larger p-values do not imply a lack of importance or even lack of effect. Any effect, no matter how tiny, can produce a small p-value if the sample size or measurement precision is high enough, and large effects may produce unimpressive p-values if the sample size is small or measurements are imprecise. Similarly, identical estimated effects will have different p-values if the precision of the estimates differs.


By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
Researchers should recognize that a p-value without context or other evidence provides limited information. For example, a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis. Likewise, a relatively large p-value does not imply evidence in favor of the null hypothesis; many other hypotheses may be equally or more consistent with the observed data. For these reasons, data analysis should not end with the calculation of a p-value when other approaches are appropriate and feasible.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT -4. The time now is 07:04 AM.


Powered by vBulletin®
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
*PLEASE NOTE: Posts are not checked for accuracy, and do not
represent the views of the Actuarial Outpost or its sponsors.
Page generated in 0.57982 seconds with 11 queries