Machine Learning – Multiple Regression with Mixed Continuous/Categorical Variables

categorical datacategorical-encodingmachine learningmultiple regressionstandardization

I have a dataset, consisting of 4 continuous and 1 categorical (three levels) indepentend variable. On this dataset, I want to perform a multiple linear regression with a regularization (specifically Lasso/Rdige/Elasticnet).

Let's assume I use Python with pandas and sklearn as my tools. My sample dataset in Python looks like this:

df = pd.DataFrame(
    data=[[4939.095037, 2.669234, 16.215479, 96.020074,  -0.023664, 2],
          [4386.046313, 5.043865, 40.406483, 201.266814, -0.478503, 2],
          [4827.804643, 7.605047, 23.051799, 84.555656,   2.998284, 1],
          [4277.870976, 6.447839, 37.703208, 156.311657, -0.569904, 2],
          [2187.534673, 0.961723, 27.030330, 57.628101,   1.466355, 2],
          [5978.240745, 7.402969, 73.276094, 106.040676,  3.125664, 0],
          [8684.959385, 7.930216, 31.960732, 141.064857, -0.693754, 1],
          [6533.489282, 3.633354, 34.480927, 134.808051, -4.912898, 0],
          [8374.502249, 7.897356, 40.525879, 127.356577,  2.891337, 2],
          [6488.086242, 7.520293, 27.731389, 86.830189,   0.560935, 2]],
    columns=['a', 'b', 'c', 'd', 'e', 'cat'])

Now I use dummy coding to encode the categorical variable cat with k=3 levels into k-1=2 levels. For this purpose I apply pd.get_dummies, but of couse sklearn.preprocessing.OneHotEncoder yields the same results:

df_dc = pd.get_dummies(df, columns=['cat'], drop_first=True)

Now I scale the data by subtracting the mean and sclaing to unit variance:

scaler = skl.preprocessing.StandardScaler().fit(df_dc)
df_scld = pd.DataFrame(data=scaler.transform(df_dc), columns=df_dc.columns)

The dummy encoded cat. var. is now quite "obfuscated", but still has unique values per level.

For simplicity, feeding this dataset into polynomial transformation with interaction terms (degree 2 or 3) is omitted here, but I usually make use of this (either before or after standardization -> see question 2).
Then depending on the dimensionality of the problem into a PCA and finally into the linear regression model with regularization.

My questions:

Should I standardize/scale my data WITH or WITHOUT dummy coded cat. variables?

In other words: Should the dummy coded cat. vars. be scaled or not? Googling and searching CV there seem to be different opinions on this, but I can't find any "ascertained" opinions on this topic. (Some say: retaining binary 0-1 is important, other say that it doesn't hurt to scale the variables, except for human readability.)
Additional information: I'm talking mainly about standardization by subtracting the mean and scaling to unit variance. Of course min-max-scaling won't affect binary variables. 🙂

What is the generally recommended preprocessing order in total?

I currently either use path no. 1 or 2, My last source (see below) suggests no. 3, but I highly doubt that…
1. Dummy coding -> polynomial transformation -> standardization/scaling -> fit model
2. Dummy coding -> standardization/scaling -> polynomial transformation -> fit model
3. polynomial transformation -> Dummy coding -> standardization/scaling -> fit model
Is there any advantage/disadvantage of dropping the most frequent level of dummy encoded variables?

In my example this would be dropping level 2. Most algorithms simply drop the first level (here level 0), but I've read many times that dropping the most frequent level should be preferred.

Is dropping a level required at all when using a regularized regression method?

General opinion seems to be yes, but reading the sklearn doc for the parameter drop, it seems like only non-regularized methods or neural networks require dropping the first level.

Some sources I've been looking up:

CV: centering and scaling dummy variables
CV: Significance of categorical predictor in logistic regression
towards data science: Preprocessing with sklearn: a complete and comprehensive guide. Honestly: I don't trust "towards data science". I've read so many false statetments and explanations there, that my first reaction towards articles on this site is mistrust…

Best Answer

We do standardization/normalization to put our features in $[0,1]$ or $[-1,1]$ range. Let suppose we are using min-max normalization to put the values in the range $[0,1]$. The answer of your question are as follows.

Should I standardize/scale my data WITH or WITHOUT dummy coded cat. variables?

There is no clear Yes/No answer to this question. But it is not mandatory to do scaling of one-hot-encoded or dummy-encoded features. The intuition behind why it is not mandatory to do scaling is as follows.
Let say you have got two encoded vectors as $A = [0 1 0]$ and $B = [1 0 0]$, you can see that $|A| = \sqrt{0^2+1^2+0^2}\;\;and\;\;|B|=\sqrt{1^2+0^2+0^2}$ will always be equals to $1$ and the distance between them will be $\sqrt{1^2 + 1^2} = \sqrt{2} = 1.41$. So why you should not do standardization is clear from this, as you can see the magnitude of the one-hot encoded features is $1$ and the distance between them is $\sqrt{2}$ hence the variance in this one-hot encoded feature is not that much so as to standardize them. But when you should consider to do standardization? It is when, when you have vectors like $[111011]$ and $[000001]$ in which the variability is very high
What is the generally recommended preprocessing order in total?

You should do Dummy coding -> polynomial transformation -> standardization/scaling -> fit model.
The Reason behind doing polynomial featurization before standardization is quite simple. If you first do standardization then your variable will be in range $[0,1]$ and then squaring them will make the polynomial feature very small due to which your model will not sustain the numerical stability of this feature

Your next questions are not clear to me. Please elaborate them

Hope this helps!

Related Solutions

Solved – How to encode factors as dumthe variables when using stepPlr

See the first example given in help for step.plr

n <- 100
p <- 3
z <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
x <- data.frame(x1=factor(z[ ,1]),x2=factor(z[ ,2]),x3=factor(z[ ,3]))
y <- sample(c(0,1),n,replace=TRUE)
fit <- step.plr(x,y)
# 'level' is automatically generated. Check 'fit$level'.

Does that answer your question?

Categorical Encoding – Coding Categorical Variables for Regression

Here is an example using the employee data.sav data, which comes with standard installation. Suppose salary is the dependent variable, job category, jobcat, is the categorical independent variable, and beginning salary, salbegin, is the continuous independent variable. Using GLM, you can perform pairwise comparisons between each pair of job categories. The steps are as follow:

With the data set open, go to Analyze > General Linear Model > Univariate.
Put the dependent variable and independent variable into the correct slots. Categorical independent variables go to "Fixed Factor(s)" and continuous ones go to "Covariate(s)." Do not worry about the Random Factors. When it's all set, click the "Model" button.
In the Model panel, highlight the two independent variables, then change the build term to "Main effects," and then click the arrow button (indicated by the red circle) to bring the two variables over. When all set, click "Continue."
Now, click the "Option" button.
In the Option panel, do the followings: 1) Highlight jobcat, 2) bring it over to the right by clicking the arrow button, 3) Check "Compare Main Effects", 4) Specify the adjustment you'd like to make for the multiple pairwise comparisons. I left it as LSD which does not adjust for multiple tests, 5) Check "Parameter Estimates" so that you'll also get the regression coefficients. When it's all done, click Continue and then OK to submit the test.
Here is the regression coefficient table:
Scroll down a bit and you'll find the pairwise comparisons table:

My questions:

Best Answer

Related Solutions

Solved – How to encode factors as dumthe variables when using stepPlr

Categorical Encoding – Coding Categorical Variables for Regression

Related Question