Est-ce que minimiser l'erreur au carré équivaut à minimiser l'erreur absolue? Pourquoi l'erreur au carré est plus populaire que ce dernier?

39

Lorsque nous effectuons une régression linéaire pour être compatibles avec un groupe de points de données ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y ny=ax+b , l’approche classique minimise l’erreur au carré. Je suis depuis longtemps perplexe devant une question quiminimisera l'erreur carrée et donnera le même résultat que minimiser l'erreur absolue.(x1,y1),(x2,y2),...,(xn,yn)? Si non, pourquoi minimiser l'erreur au carré est préférable? Existe-t-il une raison autre que "la fonction objective est différentiable"?

L’erreur carrée est également largement utilisée pour évaluer les performances du modèle, mais l’erreur absolue est moins populaire. Pourquoi l'erreur au carré est plus communément utilisée que l'erreur absolue?If taking derivatives is not involved, calculating absolute error is as easy as calculating squared error, then why squared error is so prevalent? Is there any unique advantage that can explain its prevalence?

Merci.

Tony
la source
Il y a toujours un problème d'optimisation derrière et vous voulez pouvoir calculer les gradients pour trouver le minimum / maximum.
Vladislavs Dovgalecs
11
x2<|x| for x(1,1) and x2>|x| if |x|>1. Thus, squared error penalizes large errors more than does absolute error and is more forgiving of small errors than absolute error is. This accords well with what many think is an appropriate way of doing things.
Dilip Sarwate

Réponses:

47

Minimizing square errors (MSE) is definitely not the same as minimizing absolute deviations (MAD) of errors. MSE provides the mean response of y conditioned on x, while MAD provides the median response of y conditioned on x.

Historically, Laplace originally considered the maximum observed error as a measure of the correctness of a model. He soon moved to considering MAD instead. Due to his inability to exact solving both situations, he soon considered the differential MSE. Himself and Gauss (seemingly concurrently) derived the normal equations, a closed-form solution for this problem. Nowadays, solving the MAD is relatively easy by means of linear programming. As it is well known, however, linear programming does not have a closed-form solution.

From an optimization perspective, both correspond to convex functions. However, MSE is differentiable, thus, allowing for gradient-based methods, much efficient than their non-differentiable counterpart. MAD is not differentiable at x=0.

A further theoretical reason is that, in a bayesian setting, when assuming uniform priors of the model parameters, MSE yields normal distributed errors, which has been taken as a proof of correctness of the method. Theorists like the normal distribution because they believed it is an empirical fact, while experimentals like it because they believe it a theoretical result.

A final reason of why MSE may have had the wide acceptance it has is that it is based on the euclidean distance (in fact it is a solution of the projection problem on an euclidean banach space) which is extremely intuitive given our geometrical reality.

Asterion
la source
1
(+1) for the reference to Laplace!
Xi'an
2
"Theorists like the normal distribution because they believed it is an empirical fact, while experimentals like it because they believe it a theoretical result." -- I love it. But aren't there also direct physics applications for the Gaussian distribution? And there's also the stuff about maximum entropy distributions
shadowtalker
8
@ssdecontrol I think the epigram is due to Henri Poincaré a little over a hundred years ago. Tout le monde y croit cependant, me disait un jour M. Lippmann, car les expérimentateurs s'imaginent que c'est un théorème de mathématiques, et les mathématiciens que c'est un fait expérimental. "Everyone is sure of this [that errors are normally distributed], Mr. Lippman told me one day, since the experimentalists believe that it is a mathematical theorem, and the mathematicians that it is an experimentally determined fact." from Calcul des probabilités (2nd ed., 1912), p. 171
Dilip Sarwate
1
Here is a mathematical answer. If we have a data matrix of independent variables X and a column matrix Y, then if there is a matrix b with the property Xb = Y, we have an soln. Usually we can't and we want the b that is 'closest' to an exact solution. As mathematics this is 'easy' to solve. It's the projection of Y onto the column space of X. The notions of projection and perpendicular etc, depends on the metric. The usual Euclidean L2 metric is what we are used to and it gives the least squares. The minimizing property of mse is a restatement of the fact that we have the projection.
aginensky
1
I thought the priority disagreement was between Gauss and Legendre, with Legendre preceding Gauss in publishing, but Gauss preceding Legendre in informal correspondence. I'm also (vaguely) aware that Laplace's proof is considered to be superior. Any reference on these?
PatrickT
31

As an alternative explanation, consider the following intuition:

When minimizing an error, we must decide how to penalize these errors. Indeed, the most straightforward approach to penalizing errors would be to use a linearly proportional penalty function. With such a function, each deviation from the mean is given a proportional corresponding error. Twice as far from the mean would therefore result in twice the penalty.

The more common approach is to consider a squared proportional relationship between deviations from the mean and the corresponding penalty. This will make sure that the further you are away from the mean, the proportionally more you will be penalized. Using this penalty function, outliers (far away from the mean) are deemed proportionally more informative than observations near the mean.

To give a visualisation of this, you can simply plot the penalty functions:

Comparison of MAD and MSE penalty functions

Now especially when considering the estimation of regressions (e.g. OLS), different penalty functions will yield different results. Using the linearly proportional penalty function, the regression will assign less weight to outliers than when using the squared proportional penalty function. The Median Absolute Deviation (MAD) is therefore known to be a more robust estimator. In general, it is therefore the case that a robust estimator fits most of the data points well but 'ignores' outliers. A least squares fit, in comparison, is pulled more towards the outliers. Here is a visualisation for comparison:

Comparison of OLS vs a robust estimator

Now even though OLS is pretty much the standard, different penalty functions are most certainly in use as well. As an example, you can take a look at Matlab's robustfit function which allows you to choose a different penalty (also called 'weight') function for your regression. The penalty functions include andrews, bisquare, cauchy, fair, huber, logistic, ols, talwar and welsch. Their corresponding expressions can be found on the website as well.

I hope that helps you in getting a bit more intuition for penalty functions :)

Update

If you have Matlab, I can recommend playing with Matlab's robustdemo, which was built specifically for the comparison of ordinary least squares to robust regression:

robustdemo

The demo allows you to drag individual points and immediately see the impact on both ordinary least squares and robust regression (which is perfect for teaching purposes!).

Jean-Paul
la source
3

As another answer has explained, minimizing squared error is not the same as minimizing absolute error.

The reason minimizing squared error is preferred is because it prevents large errors better.

Say your empolyer's payroll department accidentally pays each of a total of ten employees $50 less than required. That's an absolute error of $500. It's also an absolute error of $500 if the department pays just one employee $500 less. But it terms of squared error, it's 25000 versus 250000.

It's not always better to use squared error. If you have a data set with an extreme outlier due to a data acquisition error, minimizing squared error will pull the fit towards the extreme outlier much more than minimizing absolute error. That being said, it's -usually- better to use squared error.

Atsby
la source
4
The reason minimizing squared error is preferred is because it prevents large errors better. - then why not cubed?
Daniel Earwicker
@DanielEarwicker Cubed makes errors in the wrong direction subtractive. So it would have to be absolute cubed error, or stick to even powers. There is no really "good" reason that squared is used instead of higher powers (or, indeed, non-polynomial penalty functions). It's just easy to calculate, easy to minimize, and does the job.
Atsby
1
Of course I should have said any higher even power! :)
Daniel Earwicker
This has no upvotes (at the moment) but isn't this saying the same as the answer that (currently) has 15 votes (i.e. outliers have more effect)? Is this not getting votes because it is wrong, or because it misses some key info? Or because it does not have pretty graphs? ;-)
Darren Cook
@DarrenCook I suspect the "modern" approach to stats prefers MAD over OLS, and suggesting that squared error is "usually" better earned me some downvotes.
Atsby
3

In theory you could use any kind of loss function. The absolute and the squared loss functions just happen to be the most popular and the most intuitive loss functions. According to this wikipedia entry,

A common example involves estimating "location." Under typical statistical assumptions, the mean or average is the statistic for estimating location that minimizes the expected loss experienced under the squared-error loss function, while the median is the estimator that minimizes expected loss experienced under the absolute-difference loss function. Still different estimators would be optimal under other, less common circumstances.

As also explained in the wikipedia entry, the choice of the loss functions depends on how do you value deviations from your targeted object. If all deviations are equally bad for you no matter their sign, then you could use the absolute loss function. If deviations become worse for you the farther away you are from the optimum and you don't care about whether the deviation is positive or negative, then the squared loss function is your easiest choice. But if none of the above definitions of loss fit your problem at hand, because e.g. small deviations are worse for you than big deviations, then you can choose a different loss function and try to solve the minimizing problem. However the statistical properties of your solution might be hard to assess.

kristjan
la source
A little detail: "If all deviations are equally bad for you no matter their sign ..": The MAD function penalizes errors linear-proportionally. Therefore errors are not 'equally bad' but 'proportionally bad' as twice the error gets twice the penalty.
Jean-Paul
@Jean-Paul: You are right. I meant it that way. What I wanted to say with "equally bad" was that the gradient of the MAD is constant while the gradient for the MSE grows linearly with the error. Hence if the difference between two errors is constant no matter how far away from the optimum you are, while the same is not true for the MSE. I hope, that makes it a bit more understandable what I want to say.
kristjan
-1

Short answers

  1. nope
  2. the mean has more interesting statistical properties than the median
ℕʘʘḆḽḘ
la source
10
It would be great if you could qualify "more interesting statistical properties".
Momo