Solved – How to regress household income on three factors

regression

I have to regress family income (faminc; in dollars) onto husband's educational attainment (he; in years), wife's educational attainment (we; in years), and number of children less than 6 years old in household (kl6) using Stata.

(the file only contains data of 4 above factors)

I use OLS to estimate a model in the form:
$$faminc = b_1 + b_2 * he + b_3 * we + b_4 * kl6 + \epsilon $$

      Source |       SS       df       MS              Number of obs =     430
-------------+------------------------------           F(  3,   426) =   28.77
       Model |  1.4002e+11     3  4.6673e+10           Prob > F      =  0.0000
    Residual |  6.9100e+11   426  1.6221e+09           R-squared     =  0.1685
-------------+------------------------------           Adj R-squared =  0.1626
       Total |  8.3102e+11   429  1.9371e+09           Root MSE      =   40275

------------------------------------------------------------------------------
      faminc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          he |   3185.882   795.4493     4.01   0.000     1622.388    4749.376
          we |   4637.415   1059.177     4.38   0.000     2555.551    6719.279
         kl6 |  -8372.704   4343.059    -1.93   0.055     -16909.2    163.7893
       _cons |  -5998.224   11161.51    -0.54   0.591    -27936.72    15940.27

I have some questions:

1) The regression yields $b4<0$. Is this true in fact? I mean that if the family has more children, the less income they gain?

2) Is this model good enough? Should I use natural logarithm or add dummy to make it better?

Best Answer

The fact that the p-value is 5,5% only means that the coefficient of kl6 is not statistically significant at 5% level -but it is significant at 6% level, and more so at 10% level. The "5% rule" has no scientific justification whatsoever - it has historical justification and perhaps social justification, but that's another matter and a very large discussion.

Interpretation-wise, the negative coefficient gives us the marginal effect of the number of little children on household income after the educational effect has been controlled for (by the existence of the other two regressors). So what does it say? That more little children tend to reduce household income. This may appear counter-intuitive because one could think "more children provide stronger incentives to earn more income in order to provide for the larger family". Yes, but more children also mean greater demands on the parents time that must be devoted to the children, and so less time available to work and earn income. I would suggest to try a regression where you include in addition the kl6 squared. If this squared regressor obtains a negative coefficient and the plain kl6 obtains a positive coefficient, then you are possibly looking at a non-monotonic relation (i.e. that there is an income-maximizing number of little children below or above which income tends to be lower).

PS: "How can I keep a regressor in a regression?" is the mother-question that leads to data-tampering in those ingenious ways only statistics can offer. I would suggest not to ask yourself again such a question. The regression results are what they are. Statistics should not be the brush with which we paint the world in the colors we want.

Related Solutions

Logistic Regression – How to Regress When Logit is Infinity for y=1

The question characterizes logistic regression as

$$\text{logit}(y) = \beta_0 + \beta_1 x + \varepsilon$$

and proposes to fit this model using least squares. It points out that because $y$ is a binary ($0$-$1$) variable, $\text{logit}(y)$ is undefined (or should be considered infinite), which is--to say the least--problematic!

The resolution of this conundrum is to avoid taking the logit of $y$ but instead apply its inverse, the logistic function

$$f(x) = \frac{1}{1 + \exp(-x)},$$

to the right hand side. Because $y$ on the left hand side still is a random variable with possible outcomes $0$ and $1$, it must be a Bernoulli variable: that is, what we need to know about it is the chance that $y=1$, written $\Pr(y=1).$ Therefore we make another attempt in the form

$$\Pr(y=1) = f(\beta_0 + \beta_1 x).$$

This is an example of a generalized linear model. Its parameters $\beta_0$ and $\beta_1$ are typically (but not necessarily) found using Maximum Likelihood.

To understand this better, many people find it instructive to create synthetic datasets according to this model (instead of analyzing actual data, where the true model is unknown). We will look at how that might be coded in R, which is well suited to expressing and simulating statistical models. First, though, let's inspect its results.

The data are shown as jittered points (they have been randomly shifted slightly in the horizontal direction to resolve overlaps). The true underlying probability function is plotted in solid red. The probability function fit using Maximum Likliehood is plotted in dashed gray.

You can see that where the red curve is high--which means the chance of $y=1$ is high--most of the data are $1$'s, whereas where the red curve drops to low levels, most of the data are $0$'s. The height of the curve stipulates the chance that the response will be a $1$. In logistic regression, the curve usually has the sigmoidal shape of the logistic function, while the data are always either at $y=1$ or $y=0$.

Reading over the code, which is written for expressive clarity, will help make these descriptions precise.

#
# Synthesize some data.
#
set.seed(17)                        # Allows results to be reproduced exactly
n <- 8                              # Number of distinct x values
k <- 4                              # Number of independent obs's for each x
x <- rep(1:n, 4)                    # Independent values
beta <- c(3, -1)                    # True parameters
logistic <- function(x) 1 / (1 + exp(-x))
probability <- function(x, b) logistic(b[1] + b[2]*x)
y <- rbinom(n*k, size=1, prob=probability(x, beta))   # Simulated data
#
# Fit the data using a logistic regression.
#
summary(fit <- glm(y ~ x, family=binomial(link="logit")))
#
# Plot the data, the true underlying probability function, and the fitted one.
#
jitter <- runif(n*k, -1/3, 1/3)     # Displaces points to resolve overlaps
plot(x+jitter, y, type="p", xlab="x", ylab="y", main="Data with true and fitted models")
curve(probability(x, beta), col="Red", lwd=2, add=TRUE)
curve(probability(x, coef(fit)), col="Gray", lwd=2, lty=2, add=TRUE)

Best Answer

Related Solutions

Logistic Regression – How to Regress When Logit is Infinity for y=1

Related Question