Solved – How to regress household income on three factors

regression

I have to regress family income (faminc; in dollars) onto husband's educational attainment (he; in years), wife's educational attainment (we; in years), and number of children less than 6 years old in household (kl6) using Stata.

(the file only contains data of 4 above factors)

I use OLS to estimate a model in the form:
$$faminc = b_1 + b_2 * he + b_3 * we + b_4 * kl6 + \epsilon $$

      Source |       SS       df       MS              Number of obs =     430
-------------+------------------------------           F(  3,   426) =   28.77
       Model |  1.4002e+11     3  4.6673e+10           Prob > F      =  0.0000
    Residual |  6.9100e+11   426  1.6221e+09           R-squared     =  0.1685
-------------+------------------------------           Adj R-squared =  0.1626
       Total |  8.3102e+11   429  1.9371e+09           Root MSE      =   40275

------------------------------------------------------------------------------
      faminc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          he |   3185.882   795.4493     4.01   0.000     1622.388    4749.376
          we |   4637.415   1059.177     4.38   0.000     2555.551    6719.279
         kl6 |  -8372.704   4343.059    -1.93   0.055     -16909.2    163.7893
       _cons |  -5998.224   11161.51    -0.54   0.591    -27936.72    15940.27

I have some questions:

1) The regression yields $b4<0$. Is this true in fact? I mean that if the family has more children, the less income they gain?

2) Is this model good enough? Should I use natural logarithm or add dummy to make it better?

Best Answer

The fact that the p-value is 5,5% only means that the coefficient of kl6 is not statistically significant at 5% level -but it is significant at 6% level, and more so at 10% level. The "5% rule" has no scientific justification whatsoever - it has historical justification and perhaps social justification, but that's another matter and a very large discussion.

Interpretation-wise, the negative coefficient gives us the marginal effect of the number of little children on household income after the educational effect has been controlled for (by the existence of the other two regressors). So what does it say? That more little children tend to reduce household income. This may appear counter-intuitive because one could think "more children provide stronger incentives to earn more income in order to provide for the larger family". Yes, but more children also mean greater demands on the parents time that must be devoted to the children, and so less time available to work and earn income. I would suggest to try a regression where you include in addition the kl6 squared. If this squared regressor obtains a negative coefficient and the plain kl6 obtains a positive coefficient, then you are possibly looking at a non-monotonic relation (i.e. that there is an income-maximizing number of little children below or above which income tends to be lower).

PS: "How can I keep a regressor in a regression?" is the mother-question that leads to data-tampering in those ingenious ways only statistics can offer. I would suggest not to ask yourself again such a question. The regression results are what they are. Statistics should not be the brush with which we paint the world in the colors we want.

Related Question