Solved – Please help me choose variables for a multiple linear regression analysis

multiple regressionpredictor

I'm shaky on statistics and I could really use some help.

I have a set of data at the zip code level. This data includes the distance from each zip code to the nearest pediatric surgeon. My goal is to determine what factors are associated with this distance. I have data at the zip code level on race (latino/hispanic, non-lat/hsp white, non-lat/hsp black etc), median income, percent uninsured children, rurality index.

My question is, if I were to run a multiple linear regression, which of these variables would I include? Off the top of my head, I expect people farther from surgeons to be poorer, more rural, less insured. But I have no idea how to tackle race. Also, do I run two different regressions, one for race, and one for these economic factors, or can I run one regression for all these variables?

Your help is much much appreciated.

My current Minitab output:

Term                                  Coef   SE Coef  T-Value  P-Value   VIF
Constant                             59.23      1.08    55.01    0.000
Unisured Perc                       0.1699    0.0265     6.42    0.000  1.08
MeanHouseIncome                    -0.000248  0.000010   -24.11    0.000  1.33
Percent; HISPANIC OR LATINO AND    -0.0128    0.0181    -0.71    0.479  1.19
Perc_NonHispLat - BlackAAalone     -0.4059    0.0175   -23.18    0.000  1.15
Perc_NonHispLat - Asian alone      -0.9212    0.0599   -15.39    0.000  1.31
Perc_NonHispLat - American Indi     0.4887    0.0352    13.87    0.000  1.04
PercRural                          0.39551   0.00731    54.13    0.000  1.52

New output after removing constant:
enter image description here

Best Answer

Let $y_i$ be our outcome variable, distance to a surgeon.

For simplicity, let's imagine there are only three races: white, black, and asian.

Let $s^w_i$ be the share of region $i$ that is white, $s^b_i$ be the share of region $i$ that is black, and $s^a_i$ be the share of region $i$ that is asian. Let's imagine our data is pretty sensible so that for each region, all the shares add up (note they may not add up precisely because of rounding):

$$s^w_i + s^b_i + s^a_i \approx 1$$

Let's say we want to estimate four variables $\beta_0$, $\beta_w$, $\beta_b$, and $\beta_a$ in the following the linear model:

$$ y_i = \beta_0 + \beta_w s^w_i + \beta_b s^b_i + \beta_a s^a_i + \epsilon_i $$

We actually can't do that!

What's wrong? Intuitively, the problem is that we only have THREE unknown variables, not FOUR. Since the shares sum to 1 for every region, the sum of the effects for each race $\beta_w + \beta_b + \beta_a$ is the SAME THING as the constant, baseline effect $\beta_0$.

Your stats package may not even let you run that regression. And if it does let you run the regression, it will either toss out one of the races (i.e. give you solution (2) below) or give you entirely bizarre results.

Solution 1: Remove the constant

You could estimate the regression model:

$$ y_i = \beta_w s^w_i + \beta_b s^b_i + \beta_a s^a_i + \epsilon_i $$

In this case $\beta_w$ would be the expected distance to a surgeon if a region was all white. $\beta_b$ would be the expected distance to a surgeon if a region was all black, etc...

Solution 2: Exclude 1 race and estimate effects relative to that race.

This is probably a better idea. Run the regression:

$$ y_i = \beta_0 + \beta_{1} s^b_i + \beta_{2} s^a_i + \epsilon_i $$

In this case, $\beta_0$ would be mean distance to a surgeon if a region was neither black nor asian (i.e. all white). $\beta_{1}$ would be the additional distance to a surgeon if a region was all black compared to the case if the region was all white.

How are solution 1 and solution 2 linked?

The math of linear regression is that both approach (1) and approach (2) are equivalent in the sense that if you ran both regressions (and $s_i^w + s^b_i + s^a_i = 1$ holds precisely in the data for each region $i$), you would find:

$$\beta_w = \beta_0 \quad \quad \beta_b = \beta_0 + \beta_1 \quad \quad \beta_a = \beta_0 + \beta_2$$

Summary:

You want to include all your economic variables and either (1) include all your racial groups and don't include a constant or (2) exclude 1 racial group and run a standard regression.

(2) is probably the more robust approach.

Related Question