Lorsque nous effectuons une régression linéaire pour être compatibles avec un groupe de points de données ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n , l’approche classique minimise l’erreur au carré. Je suis depuis longtemps perplexe devant une question quiminimisera l'erreur carrée et donnera le même résultat que minimiser l'erreur absolue.? Si non, pourquoi minimiser l'erreur au carré est préférable? Existe-t-il une raison autre que "la fonction objective est différentiable"?
L’erreur carrée est également largement utilisée pour évaluer les performances du modèle, mais l’erreur absolue est moins populaire. Pourquoi l'erreur au carré est plus communément utilisée que l'erreur absolue?If taking derivatives is not involved, calculating absolute error is as easy as calculating squared error, then why squared error is so prevalent? Is there any unique advantage that can explain its prevalence?
Merci.
la source
Réponses:
Minimizing square errors (MSE) is definitely not the same as minimizing absolute deviations (MAD) of errors. MSE provides the mean response ofy conditioned on x , while MAD provides the median response of y conditioned on x .
Historically, Laplace originally considered the maximum observed error as a measure of the correctness of a model. He soon moved to considering MAD instead. Due to his inability to exact solving both situations, he soon considered the differential MSE. Himself and Gauss (seemingly concurrently) derived the normal equations, a closed-form solution for this problem. Nowadays, solving the MAD is relatively easy by means of linear programming. As it is well known, however, linear programming does not have a closed-form solution.
From an optimization perspective, both correspond to convex functions. However, MSE is differentiable, thus, allowing for gradient-based methods, much efficient than their non-differentiable counterpart. MAD is not differentiable atx=0 .
A further theoretical reason is that, in a bayesian setting, when assuming uniform priors of the model parameters, MSE yields normal distributed errors, which has been taken as a proof of correctness of the method. Theorists like the normal distribution because they believed it is an empirical fact, while experimentals like it because they believe it a theoretical result.
A final reason of why MSE may have had the wide acceptance it has is that it is based on the euclidean distance (in fact it is a solution of the projection problem on an euclidean banach space) which is extremely intuitive given our geometrical reality.
la source
As an alternative explanation, consider the following intuition:
When minimizing an error, we must decide how to penalize these errors. Indeed, the most straightforward approach to penalizing errors would be to use a
linearly proportional
penalty function. With such a function, each deviation from the mean is given a proportional corresponding error. Twice as far from the mean would therefore result in twice the penalty.The more common approach is to consider a
squared proportional
relationship between deviations from the mean and the corresponding penalty. This will make sure that the further you are away from the mean, the proportionally more you will be penalized. Using this penalty function, outliers (far away from the mean) are deemed proportionally more informative than observations near the mean.To give a visualisation of this, you can simply plot the penalty functions:
Now especially when considering the estimation of regressions (e.g. OLS), different penalty functions will yield different results. Using the
linearly proportional
penalty function, the regression will assign less weight to outliers than when using thesquared proportional
penalty function. The Median Absolute Deviation (MAD) is therefore known to be a more robust estimator. In general, it is therefore the case that a robust estimator fits most of the data points well but 'ignores' outliers. A least squares fit, in comparison, is pulled more towards the outliers. Here is a visualisation for comparison:Now even though OLS is pretty much the standard, different penalty functions are most certainly in use as well. As an example, you can take a look at Matlab's robustfit function which allows you to choose a different penalty (also called 'weight') function for your regression. The penalty functions include andrews, bisquare, cauchy, fair, huber, logistic, ols, talwar and welsch. Their corresponding expressions can be found on the website as well.
I hope that helps you in getting a bit more intuition for penalty functions :)
Update
If you have Matlab, I can recommend playing with Matlab's robustdemo, which was built specifically for the comparison of ordinary least squares to robust regression:
The demo allows you to drag individual points and immediately see the impact on both ordinary least squares and robust regression (which is perfect for teaching purposes!).
la source
As another answer has explained, minimizing squared error is not the same as minimizing absolute error.
The reason minimizing squared error is preferred is because it prevents large errors better.
Say your empolyer's payroll department accidentally pays each of a total of ten employees $50 less than required. That's an absolute error of $500. It's also an absolute error of $500 if the department pays just one employee $500 less. But it terms of squared error, it's 25000 versus 250000.
It's not always better to use squared error. If you have a data set with an extreme outlier due to a data acquisition error, minimizing squared error will pull the fit towards the extreme outlier much more than minimizing absolute error. That being said, it's -usually- better to use squared error.
la source
In theory you could use any kind of loss function. The absolute and the squared loss functions just happen to be the most popular and the most intuitive loss functions. According to this wikipedia entry,
As also explained in the wikipedia entry, the choice of the loss functions depends on how do you value deviations from your targeted object. If all deviations are equally bad for you no matter their sign, then you could use the absolute loss function. If deviations become worse for you the farther away you are from the optimum and you don't care about whether the deviation is positive or negative, then the squared loss function is your easiest choice. But if none of the above definitions of loss fit your problem at hand, because e.g. small deviations are worse for you than big deviations, then you can choose a different loss function and try to solve the minimizing problem. However the statistical properties of your solution might be hard to assess.
la source
Short answers
la source