Taille de l'effet comme hypothèse du test de signification

37

Aujourd'hui, au Cross Club Journal validé (pourquoi n'y étiez-vous pas?), @Mbq a demandé:

Pensez-vous que nous (scientifiques spécialisés dans les données modernes) savons ce que signifie une signification? Et comment cela se rapporte-t-il à notre confiance dans nos résultats?

@ Michelle a répondu comme certains (y compris moi) le font habituellement:

Je trouve le concept de signification (basé sur les valeurs p) de moins en moins utile au fur et à mesure de ma carrière. Par exemple, je peux utiliser des jeux de données extrêmement volumineux pour que tout soit statistiquement significatif ( $p<.01$ )

C'est probablement une question stupide, mais le problème n'est-il pas l'hypothèse testée? Si vous testez l'hypothèse nulle "A est égal à B", alors vous savez que la réponse est "Non". De plus grands ensembles de données ne feront que vous rapprocher de cette conclusion inévitablement vraie. Je crois que c'est Deming qui a donné un exemple avec l'hypothèse "le nombre de cheveux sur le côté droit d'un agneau est égal au nombre de cheveux sur son côté gauche". Bien sûr que non.

Une meilleure hypothèse serait "A ne diffère pas de B de plus que tout." Ou, dans l'exemple d'agneau, "le nombre de poils sur les côtés d'un agneau ne diffère pas de plus de X%".

Est-ce que ça a du sens?

hypothesis-testing p-value large-data Carlos Accioly
la source

1) Le test de l’équivalence moyenne (en supposant que c’est ce que vous voulez) peut être simplifié dans certains cas en un test de signification de leur différence moyenne. Avec une erreur type pour cette estimation de différence, vous pouvez effectuer toutes sortes de tests sur les tris "pas différent de B par plus ...". 2) En ce qui concerne la taille de l'échantillon - oui, pour les grandes entreprises, l'importance de l'importance diminue, mais elle reste cruciale pour les échantillons plus petits, où vous ne pouvez pas simplement générer des valeurs supplémentaires.

Ondrej le

11

Re "Of course it isn't." At a guess, a lamb has on the order of

10^{5}

$10^5$ hairs on each side. If there are an even number of such hairs and they are distributed randomly with equal chances on both sides and the sides are clearly delineated, then the chance that both numbers are exactly equal is 0.178%. In a large flock of several hundred, you should expect to see such a perfectly balanced lamb born at least once each decade (assuming an even number of hairs occurs about 50% of the time). Or: just about every old sheep farmer has had such a lamb!

whuber

1

@whuber It is determined by the purpose of the analysis. A better analogy would be what is the minimum effect size that would justify further investment in a drug following a trial. Just the existence of a statistically significant effect is not enough, as developing a drug is expensive and there may be side-effects that need to be considered. It isn't a statistical question, but a practical one.

Dikran Marsupial

2

@whuber I suspect that in most applications where there is no practical information for deciding minimum effect size of interest, then the standard hypothesis test is fine, for example testing for normality. As a Bayesian I would agree with the view as an optimisation problem rather than an hypothesis testing problem. Part of the problem with hypothesis tests results from the statistics cookbook approach, where tests are performed as a tradition without properly considering the purpose of the exercise, or the true meaning of the result (all IMHO of course).

Dikran Marsupial

1

@DikranMarsupial isn't the key there that the students are being taught tests by rote, as identified by gung below, rather than the importance of good study design? Would more of an emphasis on study design help solve some of the problem - not necessarily with big data sets?

Michelle

25

As far as significance testing goes (or anything else that does essentially the same thing as significance testing), I have long thought that the best approach in most situations is likely to be estimating a standardized effect size, with a 95% confidence interval about that effect size. There's nothing really new there--mathematically you can shuffle back and forth between them--if the p-value for a 'nil' null is <.05, then 0 will lie outside of a 95% CI, and vise versa. The advantage of this, in my opinion, is psychological; that is, it makes salient information that exists but that people can't see when only p-values are reported. For example, it is easy to see that an effect is wildly 'significant', but ridiculously small; or 'non-significant', but only because the error bars are huge whereas the estimated effect is more or less what you expected. These can be paired with raw values and their CI's.

$d=-1.6\pm.5$ ? I tend to think that there can still be value in reporting both, and functions can be written to compute these so that it's very little extra work, but I recognize that opinions will vary. At any rate, I argue that point estimates with confidence intervals replace p-values as the first part of my response.

On the other hand, I think a bigger question is 'is the thing that significance testing does what we really want?' I think the real problem is that for most people analyzing data (i.e., practitioners not statisticians), significance testing can become the entirety of data analysis. It seems to me that the most important thing is to have a principled way to think about what is going on with our data, and null hypothesis significance testing is, at best, a very small part of that. Let me give an imaginary example (I acknowledge that this is a caricature, but unfortunately, I fear it is somewhat plausible):

Bob conducts a study, gathering data on something-or-other. He expects the data will be normally distributed, clustering tightly around some value, and intends to conduct a one-sample t-test to see if his data are 'significantly different' from some pre-specified value. After collecting his sample, he checks to see if his data are normally distributed, and finds that they are not. Instead, they do not have a pronounced lump in the center but are relatively high over a given interval and then trail off with a long left tail. Bob worries about what he should do to ensure that his test is valid. He ends up doing something (e.g., a transformation, a non-parametric test, etc.), and then reports a test statistic and a p-value.

I hope this doesn't come off as nasty. I don't mean to mock anyone, but I think something like this does happen occasionally. Should this scenario occur, we can all agree that it is poor data analysis. However, the problem isn't that the test statistic or the p-value is wrong; we can posit that the data were handled properly in that respect. I would argue that the problem is Bob is engaged in what Cleveland called "rote data analysis". He appears to believe that the only point is to get the right p-value, and thinks very little about his data outside of pursuing that goal. He even could have switched over to my suggestion above and reported a standardized effect size with a 95% confidence interval, and it wouldn't have changed what I see as the larger problem (this is what I meant by doing "essentially the same thing" by a different means). In this specific case, the fact that the data didn't look the way he expected (i.e., weren't normal) is real information, it's interesting, and very possibly important, but that information is essentially just thrown away. Bob doesn't recognize this, because of the focus on significance testing. To my mind, that is the real problem with significance testing.

Let me address a few other perspectives that have been mentioned, and I want to be very clear that I am not criticizing anyone.

It is often mentioned that many people don't really understand p-values (e.g., thinking they're the probability the null is true), etc. It is sometimes argued that, if only people would use the Bayesian approach, these problems would go away. I believe that people can approach Bayesian data analysis in a manner that is just as incurious and mechanical. However, I think that misunderstanding the meaning of p-values would be less harmful if no one thought getting a p-value was the goal.
The existence of 'big data' is generally unrelated to this issue. Big data only make it obvious that organizing data analysis around 'significance' is not a helpful approach.
I do not believe the problem is with the hypothesis being tested. If people only wanted to see if the estimated value is outside of an interval, rather than if it's equal to a point value, many of the same issues could arise. (Again, I want to be clear I know you are not 'Bob'.)
For the record, I want to mention that my own suggestion from the first paragraph, does not address the issue, as I tried to point out.

For me, this is the core issue: What we really want is a principled way to think about what happened. What that means in any given situation is not cut and dried. How to impart that to students in a methods class is neither clear nor easy. Significance testing has a lot of inertia and tradition behind it. In a stats class, it's clear what needs to be taught and how. For students and practitioners it becomes possible to develop a conceptual schema for understanding the material, and a checklist / flowchart (I've seen some!) for conducting analysis. Significance testing can naturally evolve into rote data analysis without anyone being dumb or lazy or bad. That is the problem.

gung - Reinstate Monica
la source

I like confidence intervals :) One question: did you mean to imply that post hoc calculation of effect size is okay?

Michelle

@Michelle, I'm not entirely sure what you mean by "post hoc", but probably. Eg, you gather some data,

{\bar{x}}_{1} = 10

$\bar{x}_1=10$ ,

{\bar{x}}_{2} = 14

$\bar{x}_2=14$ &

S D = 6

$SD=6$ , then compute

d = .67

$d=.67$ . Now, that's biased, and the simplest situation, but you get the idea.

gung - Reinstate Monica

Yes I think we are agreeing here.

Michelle

+1 The story of Bob reminds me of this: pss.sagepub.com/content/early/2011/10/17/0956797611417632

Carlos Accioly

+1 I prefer credible intervals myself. Regarding point 1 I would argue that Bayesian alternatives are less likely to result in rote data analysis, as the definition of a probability is not so counter-intuitive, which makes it much easier to formulate the question you actually want to ask in a statistical manner. The real problem lies in that performing the test requires intergrals, which are too difficult for such methods to be widely adopted. Hopefully software will develop to the point where the user can concentrate on formulating the question and leave the rest to the computer.

Dikran Marsupial

18

Why do we insist on any form of hypothesis test in statistics?

In the wonderful book Statistics as Principled Argument Robert Abelson argues that statistical analysis is part of a principled argument about the subject in question. He says that, rather than be evaluated as hypotheses to be rejected or not rejected (or even accepted!?!) we should evaluate them based on what he calls the MAGIC criteria:

Magnitude - how big is it? Articulation - Is it full of exceptions? Is it clear? Generality - How generally does it apply? Interestingness - Do we care about the result? Credibility - Can we believe it?

My review of the book on my blog

Peter Flom - Reinstate Monica
la source

4

The problem is fomented by some professors. My PhD is in psychometrics, which is in the psychology department. I heard professors from other parts of the department say things like "just report the p-value, that's what matters". My work is consulting, mostly with graduate students and researchers in social, behavioral, educational and medical fields. The amount of misinformation that is given by doctoral committees is astonishing.

Peter Flom - Reinstate Monica

1

+1 for "Why...", that's a big part of what I was trying to get at in my answer.

gung - Reinstate Monica

Another part of what I was trying to get at in my answer is that I think this happens naturally. Btw, no fair getting two upvotes ;-), you could combine these.

gung - Reinstate Monica

13

Your last question not only makes sense: nowadays sensible industrial statisticians do not test for significant difference but for significant equivalence, that is, testing a null hypothesis of the form $H_0\colon \{|\mu_1-\mu_2|>\epsilon\}$ where $\epsilon$ is set by the user and is indeed related to the notion of "effect size". The most common equivalence test is the so-called TOST. Nevertheless the TOST strategy aims to prove that two means $\mu_1$ and $\mu_2$ are significantly $\epsilon$ -close, for example $\mu_1$ is the mean value for some measurement method and $\mu_2$ for another measurement method, and in many situations it is more sensible to assess the equivalence between the observations rather than the means. To do so we could perform hypothesis testing on quantities such that $\Pr(|X_1-X_2|>\epsilon)$ , and such hypothesis testing relates to tolerance intervals.

Stéphane Laurent
la source

(+1) And, welcome to 1000 reputation. Cheers.

cardinal

6

Traditional hypothesis tests tell you whether there is statistically significant evidence for the existence of an effect, whereas what we often want to know about is the existence of evidence of a practically significant effect.

It is certainly possible to form Bayesian "hypothesis tests" with a minimum effect size (IIRC there is an example of this in David MacKay's book on "Information Theory, Inference and Learning Algorithms", I'll look it up when I have a moment.

Normality testing is another good example, we usually know that the data are not really normally distributed, we are just testing to see if there is evidence that this is isn't a reasonable approximation. Or testing for the bias of a coin, we know it is unlikely to be completely biased as it is assymetric.

Dikran Marsupial
la source

6

A lot of this comes down to what question you are actually asking, how you design your study, and even what you mean by equal.

I ran accros an interesting little insert in the British Medical Journal once that talked about what people interpreted certain phases to mean. It turns out that "always" can mean that something happens as low as 91% of the time (BMJ VOLUME 333 26 AUGUST 2006 page 445). So maybe equal and equivalent (or within X% for some value of X) could be thought to mean the same thing. And lets ask the computer a simple equality, using R:

> (1e+5 + 1e-50) == (1e+5 - 1e-50)
[1] TRUE

Now a pure mathematician using infinite precision might say that those 2 values are not equal, but R says they are and for most practical cases they would be (If you offered to give me $\$$ (1e+5 + 1e-50), but the amount ended up being $\$$ (1e+5 - 1e-50) I would not refuse the money because it differed from what was promised).

Further if our alternative hypothesis is $H_a: \mu > \mu_0$ we often write the null as $H_0: \mu=\mu_0$ even though technically the real null is $H_0: \mu \le \mu_0$ , but we work with the equality as null since if we can show that $\mu$ is bigger than $\mu_0$ then we also know that it is bigger than all the values less than $\mu_0$ . And isn't a two-tailed test really just 2 one-tailed tests? After all, would you really say that $\mu \ne \mu_0$ but refuse to say which side of $\mu_0$ $\mu$ is on? This is partly why there is a trend towards using confidence intervals in place of p-values when possible, if my confidence interval for $\mu$ includes $\mu_0$ then while I may not be willing to believe that $\mu$ is exactly equal to $\mu_0$ , I cannot say for certain which side of $\mu_0$ $\mu$ lies on, which means they might as well be equal for practical purposes.

A lot of this comes down to asking the right question and designing the right study for that question. If you end up with enough data to show that a practically meaningless difference is statistically significant, then you have wasted resources getting that much data. It would have been better to decide what a meaningful difference would be and designed the study to give you enough power to detect that difference but not smaller.

And if we really want to split hairs, how do we define what parts of the lamb are on the right and which are on the left? If we define it by a line that by definition has equal number of hairs on each side then the answer to the above question becomes "Of Course it is".

Greg Snow
la source

I suspect the answer you get from R is simply the result of some floating point arithmetic problem, not a conscious decision to disregard irrelevant differences. Consider the classic example (.1 + .2) == .3 A “pure mathematician” would tell you they are equal, at any level of precision, yet R returns FALSE.

Gala

@GaëlLaurans, my point is that due to rounding (whether concious by human, or by computer) the concepts of exactly equal and within X% for a sufficiently small X are practically the same.

Greg Snow

5

From an organisational perspective, be it government with policy options or a company looking to roll out a new process/product, the use of a simple cost-benefit analysis can help too. I have argued in the past that (ignoring political reasons) given the known cost of a new initiative, what is the break even point for numbers of people who must be affected positively by that initiative? For example, if the new initiative is to get more unemployed people into work, and the initiative costs $100,000, does it achieve a reduction in unemployment transfers of at least $100,000? If not, then the effect of the initiative is not practically significant.

For health outcomes, the value of a statistical life takes on importance. This is because health benefits are accrued over a lifetime (and therefore the benefits are adjusted downwards in value based on a discount rate). So then instead of statistical significance, one gets arguments over how to estimate the value of a statistical life, and what discount rate should apply.

Michelle
la source

Taille de l'effet comme hypothèse du test de signification

Réponses: