Let's think about regular linear regression, and to make it concrete, let's say we are trying to predict height of people. When you regress heights against just an intercept term and no predictors, the intercept term will be be the height averaged over all the people in your sample. Lets call this term $\beta_0^{\text{no predictor}}$
Now, we want to add a predictor for sex, so we create and indicator variable that takes a 0 when the sampled person is male and 1 when the person is a female. When we regress against this model, we will get an estimates for an intercept term, $\beta_0^{\text{male reference}}$ and coefficent of the sex variable $\beta_1^{\text{male reference}}$. The estimated intercept is no longer the average height of everybody, but the average height of males, the coefficient of the sex variable is the difference in the average height between males and females.
Consider if we decided to code our indicator variable differently, so that the sex variable took the value 0 if the person was a female and 1 if the person was a male, in this specification of the model we get the estimates of the intercept and coefficient $\beta_0^{\text{female reference}}, \beta_1^{\text{female reference}}$. Now $\beta_0^{\text{female reference}}$, the intercept term, is the average height of females, and the coefficient is the difference in average height between females and males. So
$$
\begin{align}
\beta_1^{\text{male reference}} &= -\beta_1^{\text{female reference}}\\
\beta_0^{\text{male reference}} + \beta_1^{\text{male reference}} &= \beta_0^{\text{female reference}}\\
\beta_0^{\text{female reference}} + \beta_1^{\text{female reference}} &= \beta_0^{\text{male reference}}
\end{align}
$$
So, by changing how we coded the indicator variable we changed both the value of the intercept term the coefficient term, and this is exactly what we should want. When we have a multivalue indicator, you will see the same kinds of changes as you specify difference reference levels, i.e. when the indicators take on the value of 0.
In the binary indicator case the p-value of the $\beta_1$ term should not change depending on how we code, but in the multivalue indicator case it will, because p-value is a function of the size of the effect, and the average differences between groups and a reference group will likely change dependent upon the reference group. For example, we have three groups, babies, teenagers, and adults, the average height difference between adults and teenagers will be smaller than between adults and babies, and so the p-value for the coefficient for the indicator of being an adult versus a teenager should be greater than an indicator of being an adult versus a baby.
Yes, you need to separate the categories into 0/1 variables, omitting one of them. In R, this would be done with as.factor(paymentmode)
. In Stata, it is done with i.paymentmode
(which may have to be prefixed by xi:
in older versions). Some people believe in different coding schemes for these categorical variables, but really it is just a matter of how you are going to read your output, and has no effect on estimation procedure itself.
Best Answer
Q1: For Y/N variables you can, but it won't make any difference except give you control over whether Y or N is your base category in the default model fitting. For 40 category variables your model matrix will end up pretty big, it's true. More importantly it will require a lot of data to fit. Combinatorially speaking you need information about all combinations of independent variables, and even with the data you have, there'll be a lot of interpolation and model assumption.
Q2: The machine learning folk may have some ideas here. I dimly remember something about chi-squared and mutual information measures for selecting variables. It's also possible you could get the model fitting process to do it by using a Lasso (L1 regularization, a.k.a Laplace prior) on the coefficients, although I'm not sure how well current implementations scale.
Q3: If you take a biased sample then you can do a classic rare events design analysis. King and Zheng, 2001 is a good resource for how to do so: it's very simple and amounts to a simple intercept correction. So yes, this is a good idea - just don't forget to correct for the sampling scheme.
Q4: User ids are potential grouping variables, so you could, if you wanted aggregate data according to user (or other group). That could also make the estimation problem easier by moving from Bernoulli to Binomial assumptions about the dependent variable.
Q5: Any classification model will do, frankly: support vector machines, decision trees, or anything else should work, provided they scale to the size of your data and/or you can apply the rare events correction to them. Regularized logistic regression is a good start though. You might find the literature on text classification a good place to start looking.