Solved – Regression technique for data comprised of categorical explanatory variables & a continuous response variable

categorical datamodel selectionpredictive-modelsregression

i suppose one way to characterize data is by a combination of the variable types that comprises it:

 Continuous/Continuous  |  Continuous/Discrete
 -----------------------|---------------------
 Discrete/Discrete      |  Discrete/Continuous

Each of the four cells comprising the 2×2 table just above is comprised of two descriptors, one for the Explanatory Variables (EV), which we'll assume are all of the same type, and one for the Response Variable (RV):

Explanatory Variable Type / Response Variable Type

The columns represent the EV type; the rows, the CV type.

Nearly all of the data i see can be placed in either of the two cells that comprise the first row.

So for instance, OLS Regression is a suitable model type for data in row1/col1; and for data in row1/col2, Logistic Regression is an appropriate model choice.

It's the second row, and in particular row2/col2, that my Question is directed to.

I'm aware of a few regression techniques like ordinal regression which handle a particular type of discrete variables (ordinary or rank, 1st, 2nd, 3rd,….) but i am interested in techniques for handling discrete data more generally, given that most categorical variables do not have an implicit ordinal relationship among the values that comprise them.

For instance:

Sex | City_of_Residence | Car_Make&Model | Married? | DUI? | Prior_3P_Claims?
 F  |   Cleveland       |  Chevy Camaro  |   No     |  No  |   Yes

And the Response Variable is continuous–e.g., building a model to predict the quotes offered by major auto insurers–a price.

Best Answer

The type of regression is related to the dependent variable only. When the dependent variable is continuous, you can consider OLS regresssion, regardless of whether the independent variables are categorical or continuous or both. Ordinal independent variables are a bit tricky - sometimes they are treated as continuous sometimes as categorical.

If the independent variables are categorical, there are a variety of methods including effect coding and dummy coding to deal with them.

Of course, OLS makes assumptions beyond the idea that the dependent variable is continuous (or nearly so). e.g. it assumes that the residuals are independent and $\sim{N(0,1)}$