Comment prouver que la fonction de base radiale est un noyau?

35

Comment prouver que la fonction de base radiale est un noyau? Pour autant que je sache, afin de prouver cela, nous devons prouver l'un des éléments suivants:k(x,y)=exp(||xy||2)2σ2)

  1. Pour tout ensemble de vecteurs matrice K ( x 1 , x 2 , . . . , X n ) = ( k ( x i , x j ) ) n × n est semi - définie positive.x1,x2,...,xnK(x1,x2,...,xn)(k(xi,xj))n×n

  2. A mapping Φ can be presented such as k(x,y) = Φ(x),Φ(y).

Any help?

Leo
la source
1
Just to link it more obviously: the feature map is also discussed in this question, particularly Marc Claesen's answer based on Taylor series and mine which discusses both the RKHS and the general version of the L2 embedding given by Douglas below.
Dougal

Réponses:

26

Zen used method 1. Here is method 2: Map x to a spherically symmetric Gaussian distribution centered at x in the Hilbert space L2. The standard deviation and a constant factor have to be tweaked for this to work exactly. For example, in one dimension,

exp[(xz)2/(2σ2)]2πσexp[(yz)2/(2σ2)2πσdz=exp[(xy)2/(4σ2)]2πσ.

So, use a standard deviation of σ/2 and scale the Gaussian distribution to get k(x,y)=Φ(x),Φ(y). This last rescaling occurs because the L2 norm of a normal distribution is not 1 in general.

Douglas Zare
la source
2
@Zen, Douglas Zare: thank you for your great answers. How am I supposed to select the official answer now?
Leo
23

I will use method 1. Check Douglas Zare's answer for a proof using method 2.

I will prove the case when x,y are real numbers, so k(x,y)=exp((xy)2/2σ2). The general case follows mutatis mutandis from the same argument, and is worth doing.

Without loss of generality, suppose that σ2=1.

Write k(x,y)=h(xy), where

h(t)=exp(t22)=E[eitZ]
is the characteristic function of a random variable Z with N(0,1) distribution.

For real numbers x1,,xn and a1,,an, we have

j,k=1najakh(xjxk)=j,k=1najakE[ei(xjxk)Z]=E[j,k=1najeixjZakeixkZ]=E[|j=1najeixjZ|2]0,
which entails that k is a positive semidefinite function, aka a kernel.

To understand this result in greater generality, check out Bochner's Theorem: http://en.wikipedia.org/wiki/Positive-definite_function

Zen
la source
2
This is a good start, in the right direction, with two caveats: (a) h(t) is not equal to the expectation shown (check the sign in the exponent) and (b) this appears to restrict attention to the case where x and y are scalars and not vectors. I've upvoted in the meantime, because the exposition is nice and clean and I'm sure you'll quickly plug these small gaps. :-)
cardinal
1
Tks! I'm in a hurry here. :-)
Zen
1
Excuse me, I really don't see how you manage the mutatis mutandis here. If you develop the norm before passing to the h form, then you got products and you can't swap products and sum. And I simply don't see how to develop the norm after passing to the h form to obtain a nice expression. Can you lead me a bit there ? :)
Alburkerk
23

I'll add a third method, just for variety: building up the kernel from a sequence of general steps known to create pd kernels. Let X denote the domain of the kernels below and φ the feature maps.

  • Scalings: If κ is a pd kernel, so is γκ for any constant γ>0.

    Proof: if φ is the feature map for κ, γφ is a valid feature map for γκ.

  • Sums: If κ1 and κ2 are pd kernels, so is κ1+κ2.

    Proof: Concatenate the feature maps φ1 and φ2, to get x[φ1(x)φ2(x)].

  • Limits: If κ1,κ2, are pd kernels, and κ(x,y):=limnκn(x,y) exists for all x,y, then κ is pd.

    Proof: For each m,n1 and every {(xi,ci)}i=1mX×R we have that i=1mciκn(xi,xj)cj0. Taking the limit as n gives the same property for κ.

  • Products: If κ1 and κ2 are pd kernels, so is g(x,y)=κ1(x,y)κ2(x,y).

    Proof: It follows immediately from the Schur product theorem, but Schölkopf and Smola (2002) give the following nice, elementary proof. Let

    (V1,,Vm)N(0,[κ1(xi,xj)]ij)(W1,,Wm)N(0,[κ2(xi,xj)]ij)
    be independent. Thus
    Cov(ViWi,VjWj)=Cov(Vi,Vj)Cov(Wi,Wj)=κ1(xi,xj)κ2(xi,xj).
    Covariance matrices must be psd, so considering the covariance matrix of (V1W1,,VnWn) proves it.
  • Powers: If κ is a pd kernel, so is κn(x,y):=κ(x,y)n for any positive integer n.

    Proof: immediate from the "products" property.

  • Exponents: If κ is a pd kernel, so is eκ(x,y):=exp(κ(x,y)).

    Proof: We have eκ(x,y)=limNn=0N1n!κ(x,y)n; use the "powers", "scalings", "sums", and "limits" properties.

  • Functions: If κ is a pd kernel and f:XR, g(x,y):=f(x)κ(x,y)f(y) is as well.

    Proof: Use the feature map xf(x)φ(x).

Now, note that

k(x,y)=exp(12σ2xy2)=exp(12σ2x2)exp(1σ2xTy)exp(12σ2y2).
Start with the linear kernel κ(x,y)=xTy, apply "scalings" with 1σ2, apply "exponents", and apply "functions" with xexp(12σ2x2).
Dougal
la source