LASSO et la crête du point de vue bayésien: qu'en est-il du paramètre de réglage?

Les estimateurs de régression pénalisés tels que LASSO et ridge correspondraient aux estimateurs bayésiens avec certains a priori.

Oui c'est correct. Chaque fois que nous avons un problème d'optimisation impliquant la maximisation de la fonction log-vraisemblance plus une fonction de pénalité sur les paramètres, cela est mathématiquement équivalent à la maximisation postérieure où la fonction de pénalité est considérée comme le logarithme d'un noyau antérieur. Pour le voir, supposons que nous ayons une fonction de pénalité utilisant un paramètre de réglage . La fonction objective dans ces cas peut s'écrire: $^\dagger$ $w$ $\lambda$

\begin{aligned} H_{x} (θ | λ) & = ℓ_{x} (θ) - w (θ | λ) \\ = \ln (L_{x} (θ) \cdot \exp (- w (θ | λ))) \\ = \ln (\frac{L_{x} (θ) π (θ | λ)}{\int L_{x} (θ) π (θ | λ) d θ}) + const \\ = \ln π (θ | x, λ) + const, \end{aligned}

$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta|\lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta|\mathbf{x}, \lambda) + \text{const}, \\[6pt] \end{aligned} \end{equation}$

où nous utilisons le précédent $\pi(\theta|\lambda) \propto \exp ( -w(\theta|\lambda))$ . Observez ici que le paramètre de réglage dans l'optimisation est traité comme un hyperparamètre fixe dans la distribution précédente. Si vous effectuez une optimisation classique avec un paramètre de réglage fixe, cela équivaut à entreprendre une optimisation bayésienne avec un hyper-paramètre fixe. Pour la régression LASSO et Ridge, les fonctions de pénalité et les équivalents antérieurs correspondants sont:

\begin{aligned} LASSO Regression & π (θ | λ) & = \prod_{k = 1}^{m} Laplace (0, \frac{1}{λ}) = \prod_{k = 1}^{m} \frac{λ}{2} \cdot \exp (- λ | θ_{k} |), \\ Ridge Regression & π (θ | λ) & = \prod_{k = 1}^{m} Normal (0, \frac{1}{2 λ}) = \prod_{k = 1}^{m} \sqrt{λ / π} \cdot \exp (- λ θ_{k}^{2}) . \end{aligned}

$\begin{equation} \begin{aligned} \text{LASSO Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Laplace} \Big( 0, \frac{1}{\lambda} \Big) = \prod_{k=1}^m \frac{\lambda}{2} \cdot \exp ( -\lambda |\theta_k| ), \\[6pt] \text{Ridge Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Normal} \Big( 0, \frac{1}{2\lambda} \Big) = \prod_{k=1}^m \sqrt{\lambda/\pi} \cdot \exp ( -\lambda \theta_k^2 ). \\[6pt] \end{aligned} \end{equation}$

La première méthode pénalise les coefficients de régression en fonction de leur amplitude absolue, ce qui équivaut à imposer un a priori de Laplace situé à zéro. Cette dernière méthode pénalise les coefficients de régression en fonction de leur ampleur au carré, ce qui équivaut à imposer un a priori normal situé à zéro.

Maintenant, un fréquentiste optimiserait le paramètre de réglage par validation croisée. Y a-t-il un équivalent bayésien de le faire et est-il utilisé du tout?

Tant que la méthode fréquentiste peut être posée comme un problème d'optimisation (plutôt que de dire, y compris un test d'hypothèse, ou quelque chose comme ça), il y aura une analogie bayésienne utilisant un précédent équivalent. Tout comme les fréquentistes peuvent traiter le paramètre de réglage $\lambda$ comme inconnu et l'estimer à partir des données, le bayésien peut également traiter l'hyperparamètre $\lambda$ comme inconnu. Dans une analyse bayésienne complète, cela impliquerait de donner à l'hyperparamètre son propre a priori et de trouver le maximum postérieur sous cet a priori, ce qui serait analogue à maximiser la fonction objective suivante:

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - h (λ) \\ = \ln (L_{x} (θ) \cdot \exp (- w (θ | λ)) \cdot \exp (- h (λ))) \\ = \ln (\frac{L_{x} (θ) π (θ | λ) π (λ)}{\int L_{x} (θ) π (θ | λ) π (λ) d θ}) + const \\ = \ln π (θ, λ | x) + const . \end{aligned}

$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - h(\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \cdot \exp ( -h(\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta, \lambda|\mathbf{x}) + \text{const}. \\[6pt] \end{aligned} \end{equation}$

Cette méthode est en effet utilisée en analyse bayésienne dans les cas où l'analyste n'est pas à l'aise de choisir un hyperparamètre spécifique pour son a priori, et cherche à rendre le a priori plus diffus en le traitant comme inconnu et en lui donnant une distribution. (Notez que ce n'est qu'un moyen implicite de donner un plus diffus avant le paramètre d'intérêt $\theta$ .)

(Commentaire de statslearner2 ci-dessous) Je recherche des estimations MAP numériques équivalentes. Par exemple, pour une crête à pénalité fixe, il y a un a priori gaussien qui me donnera l'estimation MAP exactement égale à l'estimation de la crête. Maintenant, pour l'arête CV k-fold, quel est l'hyper-prior qui me donnerait l'estimation MAP qui est similaire à l'estimation de l'arête CV?

Avant de passer à la validation croisée du facteur $K$ , il convient tout d'abord de noter que, mathématiquement, la méthode du maximum a posteriori (MAP) est simplement une optimisation d'une fonction du paramètre $\theta$ et des données $\mathbf{x}$ . Si vous êtes prêt à autoriser des antécédents incorrects, la portée englobe tout problème d'optimisation impliquant une fonction de ces variables. Ainsi, toute méthode fréquentiste qui peut être définie comme un seul problème d'optimisation de ce type a une analogie MAP, et toute méthode fréquentiste qui ne peut pas être définie comme une seule optimisation de ce type n'a pas d'analogie MAP.

Dans la forme de modèle ci-dessus, impliquant une fonction de pénalité avec un paramètre de réglage, la validation croisée de facteur $K$ est couramment utilisée pour estimer le paramètre de réglage $\lambda$ . Pour cette méthode , vous partitionner le vecteur de données $\mathbb{x}$ en $K$ sous-vecteurs $\mathbf{x}_1,...,\mathbf{x}_K$ . Pour chacun des sous-vecteur $k=1,...,K$ vous ajustez le modèle avec les données "d'apprentissage" $\mathbf{x}_{-k}$ , puis vous mesurez l'ajustement du modèle avec les données "d'essai" $\mathbf{x}_k$ . Dans chaque ajustement, vous obtenez un estimateur pour les paramètres du modèle, qui vous donne ensuite des prédictions des données de test, qui peuvent ensuite être comparées aux données de test réelles pour donner une mesure de la "perte":

\begin{matrix} Estimator & \hat{θ} (x_{- k}, λ), \\ Predictions & {\hat{x}}_{k} (x_{- k}, λ), \\ Testing loss & L_{k} ({\hat{x}}_{k}, x_{k} | x_{- k}, λ) . \end{matrix}

$\begin{matrix} \text{Estimator} & & \hat{\theta}(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Predictions} & & \hat{\mathbf{x}}_k(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Testing loss} & & \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda). \\[6pt] \end{matrix}$

Les mesures de perte pour chacun des $K$ "replis" peuvent ensuite être agrégées pour obtenir une mesure de perte globale pour la validation croisée:

L (x, λ) = \sum_{k} L_{k} ({\hat{x}}_{k}, x_{k} | x_{- k}, λ)

$\mathscr{L}(\mathbf{x}, \lambda) = \sum_k \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda)$

On estime ensuite le paramètre de réglage en minimisant la mesure de perte globale:

\hat{λ} \equiv \hat{λ} (x) \equiv \underset{λ}{arg min} L (x, λ) .

$\hat{\lambda} \equiv \hat{\lambda}(\mathbf{x}) \equiv \underset{\lambda}{\text{arg min }} \mathscr{L}(\mathbf{x}, \lambda).$

$\theta$ $\lambda$ $\theta$

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - δ L (x, λ), \end{aligned}

where $\delta > 0$ is a weighting value on the tuning-loss. As $\delta \rightarrow \infty$ the weight on optimisation of the tuning-loss becomes infinite and so the optimisation problem yields the estimated tuning parameter from $K$ -fold cross-validation (in the limit). The remaining part of the objective function is the standard objective function conditional on this estimated value of the tuning parameter. Now, unfortunately, taking $\delta = \infty$ screws up the optimisation problem, but if we take $\delta$ to be a very large (but still finite) value, we can approximate the combination of the two optimisation problems up to arbitrary accuracy.

From the above analysis we can see that it is possible to form a MAP analogy to the model-fitting and $K$ -fold cross-validation process. This is not an exact analogy, but it is a close analogy, up to arbitrarily accuracy. It is also important to note that the MAP analogy no longer shares the same likelihood function as the original problem, since the loss function depends on the data and is thus absorbed as part of the likelihood rather than the prior. In fact, the full analogy is as follows:

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - δ L (x, λ) \\ = \ln (\frac{L_{x}^{*} (θ, λ) π (θ, λ)}{\int L_{x}^{*} (θ, λ) π (θ, λ) d θ}) + const, \end{aligned}

$\begin{equation} \begin{aligned} \mathcal{H}_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - \delta \mathscr{L}(\mathbf{x}, \lambda) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda)}{\int L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda) d\theta} \Bigg) + \text{const}, \\[6pt] \end{aligned} \end{equation}$

where $L_\mathbf{x}^*(\theta, \lambda) \propto \exp( \ell_\mathbf{x}(\theta) - \delta \mathscr{L}(\mathbf{x}, \lambda))$ and $\pi (\theta, \lambda) \propto \exp( -w(\theta|\lambda))$ , with a fixed (and very large) hyper-parameter $\delta$ .

$^\dagger$ This gives an improper prior in cases where the penalty does not correspond to the logarithm of a sigma-finite density.

Reinstate Monica
la source

Ok +1 already, but for the bounty I'm looking for these more precise answers.

statslearner2

1. I do not get how (since frequentists generally use classical hypothesis tests, etc., which have no Bayesian equivalent) connects to the rest of what I or you are saying; parameter tuning has nothing to do with hypothesis tests, or does it? 2. Do I understand you correctly that there is no Bayesian equivalent to frequentist regularized estimation when the tuning parameter is selected by cross validation? What about empirical Bayes that amoeba mentions in the comments to the OP?

Richard Hardy

3. Since regularization with cross validation seems to be quite effective for, say, prediction, doesn't point 2. suggest that the Bayesian approach is somehow inferior?

Richard Hardy

@Ben, thanks for your explicit answer and the subsequent clarifications. You have once again done a wonderful job! Regarding 3., yes, it was quite a jump; it certainly is not a strict logical conclusion. But looking at your points w.r.t. 2. (that a Bayesian method can approximate the frequentist penalized optimization with cross validation), I no longer think that Bayesian must be "inferior". The last quibble on my side is, could you perhaps explain how the last, complicated formula could arise in practice in the Bayesian paradigm? Is it something people would normally use or not?

Richard Hardy

@Ben (ctd) My problem is that I know little about Bayes. Once it gets technical, I may easily lose the perspective. So I wonder whether this complicated analogy (the last formula) is something that is just a technical possibility or rather something that people routinely use. In other words, I am interested in whether the idea behind cross validation (here in the context of penalized estimation) is resounding in the Bayesian world, whether its advantages are utilized there. Perhaps this could be a separate question, but a short description will suffice for this particular case.

Richard Hardy

LASSO et la crête du point de vue bayésien: qu'en est-il du paramètre de réglage?

Réponses: