Solved – In regard binary logistic regression, which method is better: enter or one of the forward or backward elimination methods

logisticregression

I am analysing a set of data where I try to predict an outcome (Level of women’s nutrition knowledge; whether it is High or Low) by using certain covariates (demographic characteristics of the sample). I have already done Chi-square analysis and now I am progressing to binary logistic regression.

To avoid an overly complicated presentation of the results by inclusion of a large number of non-significant variables, demographic factors found in Chi-square test to be significantly associated (p<0.05) with women’s knowledge were entered into the logistic regression analysis. These factors are: Prior Pregnancies, Planned pregnancy, Education levels, Household income, First language and Having health and/or nutrition related qualification.

  1. Dependent variable = Total score of nutrition knowledge of pregnant women; which coded as:
    a. Low : 0
    b. High : 1

2.Independent variables:
a-Prior Pregnancies has 3 levels and coded as:.Tow and more: 0 None: 1 One: 2
b-Planned pregnancy has 2 levels and coded as: N0:0 Yes: 1
c-Education levels has 4 levels and coded as: Some high school or less: 0 High school completed: 1 TAFE: 2 Tertiary education: 3
d-Household income has 3 levels and coded as: < $25000/yr: 0 $25000-50000/yr: 1 >$50000/yr: 2
e-First language has 2 levels and coded as: N0:0 Yes: 1
f-Having health and/or nutrition related qualification has 2 levels and coded as: N0:0 Yes: 1

(As seen above, the levels or class of the independents categorical variables were coded 0 for the lowest interest and 1 for the greatest interest in case of dichotomous variable and so forth for other variables).

My Questions:
1- Which category, the first category (the one of the lowest value) or the last one, should I designate it as the reference category?

2- In case of variable like Age groups which has 4 classes and the first class has the lowest value but the last class has not the highest value; for example the age groups are as the following:
Under 20 yrs (has the lowest score in the nutrition knowledge),
20-29 yrs,
30-39 yrs (has the highest score in the nutrition knowledge) and
40 yrs and above (the score of knowledge decreased)
How could I choose the category with the highest value to be the reference category when I run the binary logistic regression analysis while it is not the first or the last category?

3- In regard binary logistic regression, which method is better: enter or one of the forward or backward elimination methods? What is the deference between them? Based on what should I choose the method?

Best Answer

You have engaged in dichotomania. Categorizing age, education, knowledge, and other continuous or ordinal variables will result in a host of problems. What is the rawest form of your variables?

Neither forwards selection nor backward elimination work as advertised, and you did not provide any motivation for the use of variable selection. It doesn't solve any problem for you and creates new problems such as meaningless $P$-values and confidence intervals. There is nothing wrong with having "insignificant" variables in a model.

What do you mean by "progressing to binary logistic regression"? Did the $\chi^2$ analysis inform model specification? This would be even worse than forward variable selection.