Preuve de la formule LOOCV

18

D'après An Introduction to Statistical Learning de James et al., L'estimation de validation croisée avec oubli (LOOCV) est définie par

CV(n)=1ni=1nMSEi
where MSEi=(yiy^i)2.

Without proof, equation (5.2) states that for a least-squares or polynomial regression (whether this applies to regression on just one variable is unknown to me),

CV(n)=1ni=1n(yiy^i1hi)2
where "y^i is the ith fitted value from the original least squares fit (no idea what this means, by the way, does it mean from using all of the points in the data set?) and hi is the leverage" which is defined by
hi=1n+(xix¯)2j=1n(xjx¯)2.

How does one prove this?

My attempt: one could start by noticing that

y^i=β0+i=1kβkXk+some polynomial terms of degree 2
but apart from this (and if I recall, that formula for hi is only true for simple linear regression...), I'm not sure how to proceed from here.
Clarinetist
la source
Either your equations seem to use i for more than one thing or I'm highly confused. Either way additional clarity would be good.
Glen_b -Reinstate Monica
@Glen_b I just learned about LOOCV yesterday, so I might not understand some things correctly. From what I understand, you have a set of data points, say X={(xi,yi):iZ+}. With LOOCV, you have for each fixed (positive integer) k some validation set Vk={(xk,yk)} and a test set Tk=XVk used to generate a fitted model for each k. So say, for example, we fit our model using simple linear regression with three data points, X={(0,1),(1,2),(2,3)}. We would have (to be continued)
Clarinetist
@Glen_b V1={(0,1)} and T1={(1,2),(2,3)}. Using the points in T1, we can find that using a simple linear regression, we get the model y^i=X+1. Then we compute the MSE using V1 as the validation set and get y1=1 (just using the point given) and y^1(1)=0+1=1, giving MSE1=0. Okay, maybe using the superscript wasn't the best idea - I will change this in the original post.
Clarinetist
here are some lecture notes on the derivation pages.iu.edu/~dajmcdon/teaching/2014spring/s682/lectures/…
Xavier Bourret Sicotte

Réponses:

17

I'll show the result for any multiple linear regression, whether the regressors are polynomials of Xt or not. In fact, it shows a little more than what you asked, because it shows that each LOOCV residual is identical to the corresponding leverage-weighted residual from the full regression, not just that you can obtain the LOOCV error as in (5.2) (there could be other ways in which the averages agree, even if not each term in the average is the same).

Let me take the liberty to use slightly adapted notation.

We first show that

β^β^(t)=(u^t1ht)(XX)1Xt,(A)
where β^ is the estimate using all data and β^(t) the estimate when leaving out X(t), observation t. Let Xt be defined as a row vector such that y^t=Xtβ^. u^t are the residuals.

The proof uses the following matrix algebraic result.

Let A be a nonsingular matrix, b a vector and λ a scalar. If

λ1bA1b
Then
(A+λbb)1=A1(λ1+λbA1b)A1bbA1(B) 

The proof of (B) follows immediately from verifying

{A1(λ1+λbA1b)A1bbA1}(A+λbb)=I.

The following result is helpful to prove (A)

(X(t)X(t))1Xt=(11ht)(XX)1Xt. (C)

Proof of (C): By (B) we have, using t=1TXtXt=XX,

(X(t)X(t))1=(XXXtXt)1=(XX)1+(XX)1XtXt(XX)11Xt(XX)1Xt.
So we find
(X(t)X(t))1Xt=(XX)1Xt+(XX)1Xt(Xt(XX)1Xt1Xt(XX)1Xt)=(11ht)(XX)1Xt.

The proof of (A) now follows from (C): As

XXβ^=Xy,
we have
(X(t)X(t)+XtXt)β^=X(t)y(t)+Xtyt,
or
{Ik+(X(t)X(t))1XtXt}β^=β^(t)+(X(t)X(t))1Xt(Xtβ^+u^t).
So,
β^=β^(t)+(X(t)X(t))1Xtu^t=β^(t)+(XX)1Xtu^t1ht,
where the last equality follows from (C).

Now, note ht=Xt(XX)1Xt. Multiply through in (A) by Xt, add yt on both sides and rearrange to get, with u^(t) the residuals resulting from using β^(t) (ytXtβ^(t)),

u^(t)=u^t+(u^t1ht)ht
or
u^(t)=u^t(1ht)+u^tht1ht=u^t1ht
Christoph Hanck
la source
The definition for X(t) is missing in your answer. I assume this is a matrix X with row Xt removed.
mpiktas
Also mentioning the fact that XX=t=1TXtXt would be helpful too.
mpiktas
@mpiktas, yes, thanks for the pointers. I edited to take the first comment into account. Where exactly would the second help? Or just leave it in your comment?
Christoph Hanck
3
When you start the proof of (C) you write (X(t)X(t))1=(XXXtXt)1. That is a nice trick, but I doubt that casual reader is aware of it.
mpiktas
1
Two years later... I appreciate this answer even more, now that I've gone through a graduate-level linear models sequence. I'm re-learning this material with this new perspective. Do you have any suggested references (textbooks?) which go through derivations like what you have in this answer in detail?
Clarinetist