Aujourd'hui, au Cross Club Journal validé (pourquoi n'y étiez-vous pas?), @Mbq a demandé:
Pensez-vous que nous (scientifiques spécialisés dans les données modernes) savons ce que signifie une signification? Et comment cela se rapporte-t-il à notre confiance dans nos résultats?
@ Michelle a répondu comme certains (y compris moi) le font habituellement:
Je trouve le concept de signification (basé sur les valeurs p) de moins en moins utile au fur et à mesure de ma carrière. Par exemple, je peux utiliser des jeux de données extrêmement volumineux pour que tout soit statistiquement significatif ( )
C'est probablement une question stupide, mais le problème n'est-il pas l'hypothèse testée? Si vous testez l'hypothèse nulle "A est égal à B", alors vous savez que la réponse est "Non". De plus grands ensembles de données ne feront que vous rapprocher de cette conclusion inévitablement vraie. Je crois que c'est Deming qui a donné un exemple avec l'hypothèse "le nombre de cheveux sur le côté droit d'un agneau est égal au nombre de cheveux sur son côté gauche". Bien sûr que non.
Une meilleure hypothèse serait "A ne diffère pas de B de plus que tout." Ou, dans l'exemple d'agneau, "le nombre de poils sur les côtés d'un agneau ne diffère pas de plus de X%".
Est-ce que ça a du sens?
la source
Réponses:
As far as significance testing goes (or anything else that does essentially the same thing as significance testing), I have long thought that the best approach in most situations is likely to be estimating a standardized effect size, with a 95% confidence interval about that effect size. There's nothing really new there--mathematically you can shuffle back and forth between them--if the p-value for a 'nil' null is <.05, then 0 will lie outside of a 95% CI, and vise versa. The advantage of this, in my opinion, is psychological; that is, it makes salient information that exists but that people can't see when only p-values are reported. For example, it is easy to see that an effect is wildly 'significant', but ridiculously small; or 'non-significant', but only because the error bars are huge whereas the estimated effect is more or less what you expected. These can be paired with raw values and their CI's.
On the other hand, I think a bigger question is 'is the thing that significance testing does what we really want?' I think the real problem is that for most people analyzing data (i.e., practitioners not statisticians), significance testing can become the entirety of data analysis. It seems to me that the most important thing is to have a principled way to think about what is going on with our data, and null hypothesis significance testing is, at best, a very small part of that. Let me give an imaginary example (I acknowledge that this is a caricature, but unfortunately, I fear it is somewhat plausible):
I hope this doesn't come off as nasty. I don't mean to mock anyone, but I think something like this does happen occasionally. Should this scenario occur, we can all agree that it is poor data analysis. However, the problem isn't that the test statistic or the p-value is wrong; we can posit that the data were handled properly in that respect. I would argue that the problem is Bob is engaged in what Cleveland called "rote data analysis". He appears to believe that the only point is to get the right p-value, and thinks very little about his data outside of pursuing that goal. He even could have switched over to my suggestion above and reported a standardized effect size with a 95% confidence interval, and it wouldn't have changed what I see as the larger problem (this is what I meant by doing "essentially the same thing" by a different means). In this specific case, the fact that the data didn't look the way he expected (i.e., weren't normal) is real information, it's interesting, and very possibly important, but that information is essentially just thrown away. Bob doesn't recognize this, because of the focus on significance testing. To my mind, that is the real problem with significance testing.
Let me address a few other perspectives that have been mentioned, and I want to be very clear that I am not criticizing anyone.
For me, this is the core issue: What we really want is a principled way to think about what happened. What that means in any given situation is not cut and dried. How to impart that to students in a methods class is neither clear nor easy. Significance testing has a lot of inertia and tradition behind it. In a stats class, it's clear what needs to be taught and how. For students and practitioners it becomes possible to develop a conceptual schema for understanding the material, and a checklist / flowchart (I've seen some!) for conducting analysis. Significance testing can naturally evolve into rote data analysis without anyone being dumb or lazy or bad. That is the problem.
la source
Why do we insist on any form of hypothesis test in statistics?
In the wonderful book Statistics as Principled Argument Robert Abelson argues that statistical analysis is part of a principled argument about the subject in question. He says that, rather than be evaluated as hypotheses to be rejected or not rejected (or even accepted!?!) we should evaluate them based on what he calls the MAGIC criteria:
Magnitude - how big is it? Articulation - Is it full of exceptions? Is it clear? Generality - How generally does it apply? Interestingness - Do we care about the result? Credibility - Can we believe it?
My review of the book on my blog
la source
Your last question not only makes sense: nowadays sensible industrial statisticians do not test for significant difference but for significant equivalence, that is, testing a null hypothesis of the formH0:{|μ1−μ2|>ϵ} where ϵ is set by the user and is indeed related to the notion of "effect size". The most common equivalence test is the so-called TOST.
Nevertheless the TOST strategy aims to prove that two means μ1 and μ2 are significantly ϵ -close, for example μ1 is the mean value for some measurement method and μ2 for another measurement method, and in many situations it is more sensible to assess the equivalence between the observations rather than the means. To do so we could perform hypothesis testing on quantities such that Pr(|X1−X2|>ϵ) , and such hypothesis testing relates to tolerance intervals.
la source
Traditional hypothesis tests tell you whether there is statistically significant evidence for the existence of an effect, whereas what we often want to know about is the existence of evidence of a practically significant effect.
It is certainly possible to form Bayesian "hypothesis tests" with a minimum effect size (IIRC there is an example of this in David MacKay's book on "Information Theory, Inference and Learning Algorithms", I'll look it up when I have a moment.
Normality testing is another good example, we usually know that the data are not really normally distributed, we are just testing to see if there is evidence that this is isn't a reasonable approximation. Or testing for the bias of a coin, we know it is unlikely to be completely biased as it is assymetric.
la source
A lot of this comes down to what question you are actually asking, how you design your study, and even what you mean by equal.
I ran accros an interesting little insert in the British Medical Journal once that talked about what people interpreted certain phases to mean. It turns out that "always" can mean that something happens as low as 91% of the time (BMJ VOLUME 333 26 AUGUST 2006 page 445). So maybe equal and equivalent (or within X% for some value of X) could be thought to mean the same thing. And lets ask the computer a simple equality, using R:
Now a pure mathematician using infinite precision might say that those 2 values are not equal, but R says they are and for most practical cases they would be (If you offered to give me$ (1e+5 + 1e-50), but the amount ended up being $ (1e+5 - 1e-50) I would not refuse the money because it differed from what was promised).
Further if our alternative hypothesis isHa:μ>μ0 we often write the null as H0:μ=μ0 even though technically the real null is H0:μ≤μ0 , but we work with the equality as null since if we can show that μ is bigger than μ0 then we also know that it is bigger than all the values less than μ0 . And isn't a two-tailed test really just 2 one-tailed tests? After all, would you really say that μ≠μ0 but refuse to say which side of μ0 μ is on? This is partly why there is a trend towards using confidence intervals in place of p-values when possible, if my confidence interval for μ includes μ0 then while I may not be willing to believe that μ is exactly equal to μ0 , I cannot say for certain which side of μ0 μ lies on, which means they might as well be equal for practical purposes.
A lot of this comes down to asking the right question and designing the right study for that question. If you end up with enough data to show that a practically meaningless difference is statistically significant, then you have wasted resources getting that much data. It would have been better to decide what a meaningful difference would be and designed the study to give you enough power to detect that difference but not smaller.
And if we really want to split hairs, how do we define what parts of the lamb are on the right and which are on the left? If we define it by a line that by definition has equal number of hairs on each side then the answer to the above question becomes "Of Course it is".
la source
From an organisational perspective, be it government with policy options or a company looking to roll out a new process/product, the use of a simple cost-benefit analysis can help too. I have argued in the past that (ignoring political reasons) given the known cost of a new initiative, what is the break even point for numbers of people who must be affected positively by that initiative? For example, if the new initiative is to get more unemployed people into work, and the initiative costs
$100,000
, does it achieve a reduction in unemployment transfers of at least$100,000
? If not, then the effect of the initiative is not practically significant.For health outcomes, the value of a statistical life takes on importance. This is because health benefits are accrued over a lifetime (and therefore the benefits are adjusted downwards in value based on a discount rate). So then instead of statistical significance, one gets arguments over how to estimate the value of a statistical life, and what discount rate should apply.
la source