Machine Learning – Multiple Regression with Mixed Continuous/Categorical Variables

categorical datacategorical-encodingmachine learningmultiple regressionstandardization

I have a dataset, consisting of 4 continuous and 1 categorical (three levels) indepentend variable. On this dataset, I want to perform a multiple linear regression with a regularization (specifically Lasso/Rdige/Elasticnet).

Let's assume I use Python with pandas and sklearn as my tools. My sample dataset in Python looks like this:

df = pd.DataFrame(
    data=[[4939.095037, 2.669234, 16.215479, 96.020074,  -0.023664, 2],
          [4386.046313, 5.043865, 40.406483, 201.266814, -0.478503, 2],
          [4827.804643, 7.605047, 23.051799, 84.555656,   2.998284, 1],
          [4277.870976, 6.447839, 37.703208, 156.311657, -0.569904, 2],
          [2187.534673, 0.961723, 27.030330, 57.628101,   1.466355, 2],
          [5978.240745, 7.402969, 73.276094, 106.040676,  3.125664, 0],
          [8684.959385, 7.930216, 31.960732, 141.064857, -0.693754, 1],
          [6533.489282, 3.633354, 34.480927, 134.808051, -4.912898, 0],
          [8374.502249, 7.897356, 40.525879, 127.356577,  2.891337, 2],
          [6488.086242, 7.520293, 27.731389, 86.830189,   0.560935, 2]],
    columns=['a', 'b', 'c', 'd', 'e', 'cat'])

Now I use dummy coding to encode the categorical variable cat with k=3 levels into k-1=2 levels. For this purpose I apply pd.get_dummies, but of couse sklearn.preprocessing.OneHotEncoder yields the same results:

df_dc = pd.get_dummies(df, columns=['cat'], drop_first=True)

Now I scale the data by subtracting the mean and sclaing to unit variance:

scaler = skl.preprocessing.StandardScaler().fit(df_dc)
df_scld = pd.DataFrame(data=scaler.transform(df_dc), columns=df_dc.columns)

The dummy encoded cat. var. is now quite "obfuscated", but still has unique values per level.

For simplicity, feeding this dataset into polynomial transformation with interaction terms (degree 2 or 3) is omitted here, but I usually make use of this (either before or after standardization -> see question 2).
Then depending on the dimensionality of the problem into a PCA and finally into the linear regression model with regularization.


My questions:

  1. Should I standardize/scale my data WITH or WITHOUT dummy coded cat. variables?

In other words: Should the dummy coded cat. vars. be scaled or not? Googling and searching CV there seem to be different opinions on this, but I can't find any "ascertained" opinions on this topic. (Some say: retaining binary 0-1 is important, other say that it doesn't hurt to scale the variables, except for human readability.)
Additional information: I'm talking mainly about standardization by subtracting the mean and scaling to unit variance. Of course min-max-scaling won't affect binary variables. 🙂

  1. What is the generally recommended preprocessing order in total?

    I currently either use path no. 1 or 2, My last source (see below) suggests no. 3, but I highly doubt that…

    1. Dummy coding -> polynomial transformation -> standardization/scaling -> fit model
    2. Dummy coding -> standardization/scaling -> polynomial transformation -> fit model
    3. polynomial transformation -> Dummy coding -> standardization/scaling -> fit model
  2. Is there any advantage/disadvantage of dropping the most frequent level of dummy encoded variables?

In my example this would be dropping level 2. Most algorithms simply drop the first level (here level 0), but I've read many times that dropping the most frequent level should be preferred.

  1. Is dropping a level required at all when using a regularized regression method?

General opinion seems to be yes, but reading the sklearn doc for the parameter drop, it seems like only non-regularized methods or neural networks require dropping the first level.


Some sources I've been looking up:

Best Answer

We do standardization/normalization to put our features in $[0,1]$ or $[-1,1]$ range. Let suppose we are using min-max normalization to put the values in the range $[0,1]$. The answer of your question are as follows.

  1. Should I standardize/scale my data WITH or WITHOUT dummy coded cat. variables?

    There is no clear Yes/No answer to this question. But it is not mandatory to do scaling of one-hot-encoded or dummy-encoded features. The intuition behind why it is not mandatory to do scaling is as follows.
    Let say you have got two encoded vectors as $A = [0 1 0]$ and $B = [1 0 0]$, you can see that $|A| = \sqrt{0^2+1^2+0^2}\;\;and\;\;|B|=\sqrt{1^2+0^2+0^2}$ will always be equals to $1$ and the distance between them will be $\sqrt{1^2 + 1^2} = \sqrt{2} = 1.41$. So why you should not do standardization is clear from this, as you can see the magnitude of the one-hot encoded features is $1$ and the distance between them is $\sqrt{2}$ hence the variance in this one-hot encoded feature is not that much so as to standardize them. But when you should consider to do standardization? It is when, when you have vectors like $[111011]$ and $[000001]$ in which the variability is very high

  2. What is the generally recommended preprocessing order in total?

    You should do Dummy coding -> polynomial transformation -> standardization/scaling -> fit model.
    The Reason behind doing polynomial featurization before standardization is quite simple. If you first do standardization then your variable will be in range $[0,1]$ and then squaring them will make the polynomial feature very small due to which your model will not sustain the numerical stability of this feature

Your next questions are not clear to me. Please elaborate them

Hope this helps!