Si X
Comment expliquer intuitivement ( X T X ) - 1
regression
variance
least-squares
Daniel Yefimov
la source
la source
Réponses:
Considérons une régression simple sans terme constant et où le régresseur unique est centré sur sa moyenne d'échantillon. Alors X ′ XX′X est ( nn fois) sa variance d'échantillon, et ( X ′ X ) - 1(X′X)−1 son recirpocal. Ainsi, plus la variance = variabilité dans le régresseur est élevée, plus la variance de l'estimateur de coefficient est faible: plus nous avons de variabilité dans la variable explicative, plus nous pouvons estimer avec précision le coefficient inconnu.
Pourquoi? Parce que plus un régresseur est varié, plus il contient d'informations. Lorsque les régresseurs sont nombreux, cela se généralise à l'inverse de leur matrice variance-covariance, qui prend également en compte la co-variabilité des régresseurs. Dans le cas extrême où X ′ XX′X est diagonal, la précision de chaque coefficient estimé ne dépend que de la variance / variabilité du régresseur associé (compte tenu de la variance du terme d'erreur).
la source
Une façon simple de voir σ 2 ( X T X ) - 1σ2(XTX)−1 est comme l'analogue matriciel (multivarié) de σ 2∑ n i = 1 ( X i - ˉ X ) 2σ2∑ni=1(Xi−X¯)2 , qui est la variance du coefficient de pente dans la régression OLS simple. On peut même obtenirσ2∑ n i = 1 X 2 iσ2∑ni=1X2i pour cette variance en omettant l'ordonnée à l'origine dans le modèle, c'est-à-dire en effectuant une régression à travers l'origine.
De l'une ou l'autre de ces formules, on peut voir qu'une variabilité plus grande de la variable prédictive conduira généralement à une estimation plus précise de son coefficient. C'est l'idée souvent exploitée dans la conception des expériences, où en choisissant des valeurs pour les prédicteurs (non aléatoires), on essaie de rendre le déterminant de ( X T X ) aussi grand que possible, le déterminant étant une mesure de la variabilité.(XTX)
la source
La transformation linéaire de la variable aléatoire gaussienne est-elle utile? En utilisant la règle que si, x ∼ N ( μ , Σ ) , alors A x + b ∼ N ( A μ + b , A T Σ A ) .x∼N(μ,Σ) Ax+b ∼N(Aμ+b,ATΣA)
En supposant que Y = X β + ϵ est le modèle sous-jacent et ϵ ∼ N ( 0 , σ 2 ) .Y=Xβ+ϵ ϵ∼N(0,σ2)
∴Y∼N(Xβ,σ2)XTY∼N(XTXβ,Xσ2XT)(XTX)−1XTY∼N[β,(XTX)−1σ2]
So (XTX)−1XT(XTX)−1XT is just a complicated scaling matrix that transforms the distribution of YY .
Hope that was helpful.
la source
I'll take a different approach towards developing the intuition that underlies the formula Varˆβ=σ2(X′X)−1Varβ^=σ2(X′X)−1 . When developing intuition for the multiple regression model, it's helpful to consider the bivariate linear regression model, viz., yi=α+βxi+εi,i=1,…,n.
To help develop the intuition, we will assume that the simplest Gauss-Markov assumptions are satisfied: xixi nonstochastic, ∑ni=1(xi−ˉx)2>0∑ni=1(xi−x¯)2>0 for all nn , and εi∼iid(0,σ2)εi∼iid(0,σ2) for all i=1,…,ni=1,…,n . As you already know very well, these conditions guarantee that Varˆβ=1nσ2(Varx)−1,
Why should doubling the sample size, ceteris paribus, cause the variance of ˆββ^ to be cut in half? This result is intimately linked to the iid assumption applied to εε : Since the individual errors are assumed to be iid, each observation should be treated ex ante as being equally informative. And, doubling the number of observations doubles the amount of information about the parameters that describe the (assumed linear) relationship between xx and yy . Having twice as much information cuts the uncertainty about the parameters in half. Similarly, it should be straightforward to develop one's intuition as to why doubling σ2σ2 also doubles the variance of ˆββ^ .
Let's turn, then, to your main question, which is about developing intuition for the claim that the variance of ˆββ^ is inversely proportional to the variance of xx . To formalize notions, let us consider two separate bivariate linear regression models, called Model (1)(1) and Model (2)(2) from now on. We will assume that both models satisfy the assumptions of the simplest form of the Gauss-Markov theorem and that the models share the exact same values of αα , ββ , nn , and σ2σ2 . Under these assumptions, it is easy to show that Eˆβ(1)=Eˆβ(2)=βEβ^(1)=Eβ^(2)=β ; in words, both estimators are unbiased. Crucially, we will also assume that whereas ˉx(1)=ˉx(2)=ˉxx¯(1)=x¯(2)=x¯ , Varx(1)≠Varx(2)Varx(1)≠Varx(2) . Without loss of generality, let us assume that Varx(1)>Varx(2)Varx(1)>Varx(2) . Which estimator of ˆββ^ will have the smaller variance? Put differently, will ˆβ(1)β^(1) or ˆβ(2)β^(2) be closer, on average, to ββ ?
From the earlier discussion, we have Varˆβ(k)=1nσ2/Varx(k))Varβ^(k)=1nσ2/Varx(k)) for k=1,2k=1,2 . Because Varx(1)>Varx(2)Varx(1)>Varx(2) by assumption, it follows that Varˆβ(1)<Varˆβ(2)Varβ^(1)<Varβ^(2) . What, then, is the intuition behind this result?
Because by assumption Varx(1)>Varx(2)Varx(1)>Varx(2) , on average each x(1)ix(1)i will be farther away from ˉxx¯ than is the case, on average, for x(2)ix(2)i . Let us denote the expected average absolute difference between xixi and ˉxx¯ by dxdx . The assumption that Varx(1)>Varx(2)Varx(1)>Varx(2) implies that d(1)x>d(2)xd(1)x>d(2)x . The bivariate linear regression model, expressed in deviations from means, states that dy=βd(1)xdy=βd(1)x for Model (1)(1) and dy=βd(2)xdy=βd(2)x for Model (2)(2) . If β≠0β≠0 , this means that the deterministic component of Model (1)(1) , βd(1)xβd(1)x , has a greater influence on dydy than does the deterministic component of Model (2)(2) , βd(2)xβd(2)x . Recall that the both models are assumed to satisfy the Gauss-Markov assumptions, that the error variances are the same in both models, and that β(1)=β(2)=ββ(1)=β(2)=β . Since Model (1)(1) imparts more information about the contribution of the deterministic component of y than does Model (2), it follows that the precision with which the deterministic contribution can be estimated is greater for Model (1) than is the case for Model (2). The converse of greater precision is a lower variance of the point estimate of β.
It is reasonably straightforward to generalize the intuition obtained from studying the simple regression model to the general multiple linear regression model. The main complication is that instead of comparing scalar variances, it is necessary to compare the "size" of variance-covariance matrices. Having a good working knowledge of determinants, traces and eigenvalues of real symmetric matrices comes in very handy at this point :-)
la source
Say we have n observations (or sample size) and p parameters.
The covariance matrix Var(ˆβ) of the estimated parameters ˆβ1,ˆβ2 etc. is a representation of the accuracy of the estimated parameters.
If in an ideal world the data could be perfectly described by the model, then the noise will be σ2=0. Now, the diagonal entries of Var(ˆβ) correspond to Var(^β1),Var(^β2) etc. The derived formula for the variance agrees with the intuition that if the noise is lower, the estimates will be more accurate.
In addition, as the number of measurements gets larger, the variance of the estimated parameters will decrease. So, overall the absolute value of the entries of XTX will be higher, as the number of columns of XT is n and the number of rows of X is n, and each entry of XTX is a sum of n product pairs. The absolute value of the entries of the inverse (XTX)−1 will be lower.
Hence, even if there is a lot of noise, we can still reach good estimates ^βi of the parameters if we increase the sample size n.
I hope this helps.
Reference: Section 7.3 on Least squares: Cosentino, Carlo, and Declan Bates. Feedback control in systems biology. Crc Press, 2011.
la source
This builds on @Alecos Papadopuolos' answer.
Recall that the result of a least-squares regression doesn't depend on the units of measurement of your variables. Suppose your X-variable is a length measurement, given in inches. Then rescaling X, say by multiplying by 2.54 to change the unit to centimeters, doesn't materially affect things. If you refit the model, the new regression estimate will be the old estimate divided by 2.54.
The X′X matrix is the variance of X, and hence reflects the scale of measurement of X. If you change the scale, you have to reflect this in your estimate of β, and this is done by multiplying by the inverse of X′X.
la source