Une standardisation est-elle nécessaire avant d'ajuster la régression logistique?

39

Ma question est la suivante: devons-nous normaliser l'ensemble de données pour nous assurer que toutes les variables ont la même échelle, entre [0,1], avant d'ajuster la régression logistique. La formule est la suivante:

ximin(xi)max(xi)min(xi)

Mon ensemble de données a 2 variables, elles décrivent la même chose pour deux canaux, mais le volume est différent. Supposons que ce soit le nombre de visites de clients dans deux magasins, vous devez déterminer si un client achète. Parce qu'un client peut visiter les deux magasins, ou deux fois le premier magasin, un deuxième magasin avant de faire un achat. mais le nombre total de visites de clients pour le premier magasin est 10 fois supérieur à celui du deuxième magasin. Quand je corresponds cette régression logistique, sans normalisation, coef(store1)=37, coef(store2)=13; si je standardise les données, alors coef(store1)=133, coef(store2)=11. Quelque chose comme ça. Quelle approche a plus de sens?

Et si j'insère un modèle d'arbre de décision? Je sais que les modèles d’arborescence n’ont pas besoin de normalisation car le modèle lui-même l’ajustera d’une manière ou d’une autre. Mais vérifier avec vous tous.

utilisateur1946504
la source
10
Vous n'avez pas besoin de normaliser sauf si votre régression est régularisée. Cependant, cela facilite parfois l’interprétabilité et fait rarement mal.
alex
3
xix¯sd(x)?
Peter Flom - Reinstate Monica
1
@Peter, that's what I thought before, but I found an article benetzkorn.com/2011/11/data-normalization-and-standardization/…>, it seems that normalization and standardization are different things. One is to make mean 0 variance 1, the other is to rescale each variable. That's where I get confused. Thanks for your reply.
user1946504
7
To me standardization makes interpretation much more difficult.
Frank Harrell
2
To clarify on what @alex said, scaling your data means the optimal regularisation factor C changes. So you need to choose C after standardising the data.
akxlr

Réponses:

37

Standardization isn't required for logistic regression. The main goal of standardizing features is to help convergence of the technique used for optimization. For example, if you use Newton-Raphson to maximize the likelihood, standardizing the features makes the convergence faster. Otherwise, you can run your logistic regression without any standardization treatment on the features.

Aymen
la source
Thanks for your reply. Does that mean standardization is preferred? Since we definitely want the model converge and when we have millions of variables, it's just easier to implement the logic of standardization in the modeling pipeline than tuning the variables one by one as needed. Am I understanding right?
user1946504
4
that depends on the purpose of the analysis. Modern software can handle pretty extreme data without standardizing. If there is a natural unit for each variables (years, euros, kg, etc.) then I would be hesitant to standardize, though I feel free to change the unit from kg to for example tons or grams whenever that makes more sense.
Maarten Buis
19

@Aymen is right, you don't need to normalize your data for logistic regression. (For more general information, it may help to read through this CV thread: When should you center your data & when should you standardize?; you might also note that your transformation is more commonly called 'normalizing', see: How to verify a distribution is normalized?) Let me address some other points in the question.

It is worth noting here that in logistic regression your coefficients indicate the effect of a one-unit change in your predictor variable on the log odds of 'success'. The effect of transforming a variable (such as by standardizing or normalizing) is to change what we are calling a 'unit' in the context of our model. Your raw x data varied across some number of units in the original metric. After you normalized, your data ranged from 0 to 1. That is, a change of one unit now means going from the lowest valued observation to the highest valued observation. The amount of increase in the log odds of success has not changed. From these facts, I suspect that your first variable (store1) spanned 133/373.6 original units, and your second variable (store2) spanned only 11/130.85 original units.

gung - Reinstate Monica
la source
17

If you use logistic regression with LASSO or ridge regression (as Weka Logistic class does) you should. As Hastie,Tibshirani and Friedman points out (page 82 of the pdf or at page 63 of the book):

The ridge solutions are not equivariant under scaling of the inputs, and so one normally standardizes the inputs before solving.

Also this thread does.

eracle
la source