I have a dataset, consisting of 4 continuous and 1 categorical (three levels) indepentend variable. On this dataset, I want to perform a multiple linear regression with a regularization (specifically Lasso/Rdige/Elasticnet).
Let's assume I use Python with pandas
and sklearn
as my tools. My sample dataset in Python looks like this:
df = pd.DataFrame(
data=[[4939.095037, 2.669234, 16.215479, 96.020074, 0.023664, 2],
[4386.046313, 5.043865, 40.406483, 201.266814, 0.478503, 2],
[4827.804643, 7.605047, 23.051799, 84.555656, 2.998284, 1],
[4277.870976, 6.447839, 37.703208, 156.311657, 0.569904, 2],
[2187.534673, 0.961723, 27.030330, 57.628101, 1.466355, 2],
[5978.240745, 7.402969, 73.276094, 106.040676, 3.125664, 0],
[8684.959385, 7.930216, 31.960732, 141.064857, 0.693754, 1],
[6533.489282, 3.633354, 34.480927, 134.808051, 4.912898, 0],
[8374.502249, 7.897356, 40.525879, 127.356577, 2.891337, 2],
[6488.086242, 7.520293, 27.731389, 86.830189, 0.560935, 2]],
columns=['a', 'b', 'c', 'd', 'e', 'cat'])
Now I use dummy coding to encode the categorical variable cat
with k=3
levels into k1=2
levels. For this purpose I apply pd.get_dummies
, but of couse sklearn.preprocessing.OneHotEncoder
yields the same results:
df_dc = pd.get_dummies(df, columns=['cat'], drop_first=True)
Now I scale the data by subtracting the mean and sclaing to unit variance:
scaler = skl.preprocessing.StandardScaler().fit(df_dc)
df_scld = pd.DataFrame(data=scaler.transform(df_dc), columns=df_dc.columns)
The dummy encoded cat. var. is now quite "obfuscated", but still has unique values per level.
For simplicity, feeding this dataset into polynomial transformation with interaction terms (degree 2 or 3) is omitted here, but I usually make use of this (either before or after standardization > see question 2).
Then depending on the dimensionality of the problem into a PCA and finally into the linear regression model with regularization.
My questions:
 Should I standardize/scale my data WITH or WITHOUT dummy coded cat. variables?
In other words: Should the dummy coded cat. vars. be scaled or not? Googling and searching CV there seem to be different opinions on this, but I can't find any "ascertained" opinions on this topic. (Some say: retaining binary 01 is important, other say that it doesn't hurt to scale the variables, except for human readability.)
Additional information: I'm talking mainly about standardization by subtracting the mean and scaling to unit variance. Of course minmaxscaling won't affect binary variables. ðŸ™‚

What is the generally recommended preprocessing order in total?
I currently either use path no. 1 or 2, My last source (see below) suggests no. 3, but I highly doubt that…
 Dummy coding > polynomial transformation > standardization/scaling > fit model
 Dummy coding > standardization/scaling > polynomial transformation > fit model
 polynomial transformation > Dummy coding > standardization/scaling > fit model

Is there any advantage/disadvantage of dropping the most frequent level of dummy encoded variables?
In my example this would be dropping level 2. Most algorithms simply drop the first level (here level 0), but I've read many times that dropping the most frequent level should be preferred.
 Is dropping a level required at all when using a regularized regression method?
General opinion seems to be yes, but reading the sklearn doc for the parameter drop
, it seems like only nonregularized methods or neural networks require dropping the first level.
Some sources I've been looking up:
 CV: centering and scaling dummy variables
 CV: Significance of categorical predictor in logistic regression
 towards data science: Preprocessing with sklearn: a complete and comprehensive guide. Honestly: I don't trust "towards data science". I've read so many false statetments and explanations there, that my first reaction towards articles on this site is mistrust…
Best Answer
We do standardization/normalization to put our features in $[0,1]$ or $[1,1]$ range. Let suppose we are using minmax normalization to put the values in the range $[0,1]$. The answer of your question are as follows.
Should I standardize/scale my data WITH or WITHOUT dummy coded cat. variables?
There is no clear Yes/No answer to this question. But it is not mandatory to do scaling of onehotencoded or dummyencoded features. The intuition behind why it is not mandatory to do scaling is as follows.
Let say you have got two encoded vectors as $A = [0 1 0]$ and $B = [1 0 0]$, you can see that $A = \sqrt{0^2+1^2+0^2}\;\;and\;\;B=\sqrt{1^2+0^2+0^2}$ will always be equals to $1$ and the distance between them will be $\sqrt{1^2 + 1^2} = \sqrt{2} = 1.41$. So why you should not do standardization is clear from this, as you can see the magnitude of the onehot encoded features is $1$ and the distance between them is $\sqrt{2}$ hence the variance in this onehot encoded feature is not that much so as to standardize them. But when you should consider to do standardization? It is when, when you have vectors like $[111011]$ and $[000001]$ in which the variability is very high
What is the generally recommended preprocessing order in total?
You should do Dummy coding > polynomial transformation > standardization/scaling > fit model.
The Reason behind doing polynomial featurization before standardization is quite simple. If you first do standardization then your variable will be in range $[0,1]$ and then squaring them will make the polynomial feature very small due to which your model will not sustain the numerical stability of this feature
Your next questions are not clear to me. Please elaborate them
Hope this helps!