Solved – How to do multiple regression with limited experience and (hopefully) excel

excelgeneralized linear modellogisticmultiple regressionmultivariate analysis

I am doing a study of how legal need relates to a number of predictors.

Outcome Variable: Legal Need (Yes or No)
Possible Predictors: Age, Gender, Race, Ethnicity, Language, Clinic, Insurance Status.
Predictors based on bivariate analysis: Age, Language, Clinic, Insurance Status. All ORs were around 1.6 and were significant, typical 95CI between 1.2-2.0.
Number of subjects: 8803

I suspect that the one variable that is going to fall out of a multiple regression analysis would be Clinic, as the Asthma Clinic (vs the well-visit clinic) has in general, older children who don't speak English and are more on Medicaid (the other predictors from bivariate analysis).

Knowing all of this, I still can't figure out how to run a multiple regression in Excel.

I keep getting Pearson's coefficients like 0.06, which it then says is significant.

Either way I think I need to go beyond this, maybe a stepwise forward or backward analysis?

It's difficult for me to grasp when the outcome is a 0 (no) or 1 (yes), this doesn't really result in anything linear when the percent "yes's" are typically in the 15-30% range.

Am I missing something? Besides years of stats training?

Best Answer

Number of subjects: 8803 Outcome Variable: Legal Need (Yes or No)

See the end, but I will deal with side issues first.

How to do multiple regression with ... excel?

The easiest way to do multiple regression in Excel is to install the Data Analysis Toolpack, which should have come with Excel but by default isn't installed.

I keep getting Pearson's coefficients like 0.06, which it then says is significant

Why does this surprise you?

(And by a 'Pearson coefficient' are you referring to $R$ or $R^2$?)

Either way I think I need to go beyond this, maybe a stepwise forward or backward analysis?

Why? What are you trying to achieve?

It's difficult for me to grasp when the outcome is a 0 (no) or 1 (yes), this doesn't really result in anything linear when the percent "yes's" are typically in the 15-30% range. Am I missing something?

You should start with asking about this issue. It's by far the most critical.

Multiple regression is not the right tool for this problem. You likely need to use logistic regression, which would model a function of the probability of a 'yes' in terms of a linear function of the predictors.

So where multiple regression would fit a model like $Y_i = x_i^T\beta + \epsilon_i$ with independent, identically distributed $\epsilon_i \sim N(0,\sigma^2)$, a logistic regression would model the $Y_i$ as independent Bernoulli random variables, where if $p_i = P(Y_i=1)$, then the probability is modelled as a function of predictors like so: $\log(\frac{p_i}{1-p_i}) = x_i^T\beta$.

See Bernoulli distribution (and also Binomial distribution of which the Bernoulli is a special case).

Logistic regression is a special case of Generalized Linear Models (GLMs).

As an example of the resources that are available, John Fox has some online course materials here that discuss linear models (such as multiple regression) and then simple logit models (simple case of what we're talking about) before going onto the multiple-regression version. Then later, his materials go on to GLMs more generally.

If you want a text that covers the territory, maybe this one, though there are lots of good ones out there; different books suit different people.

You're not going to manage to implement logistic regression in vanilla Excel unless you know a lot more about things than you do (though there are no doubt Add-In packages you can buy).

You're best off using something actually designed for this sort of analysis.

It will take some time and effort to get up to speed (if you don't understand multiple regression well, you'll need to get some background before you'll find logistic regression models making much sense), and then you'll need to learn a package that will fit the model.

R is free and does this sort of thing pretty easily. If you google logistic regression in R you can even find a few videos as well as various beginner* level documents.

* beginner to logistic regression, not beginner to statistics

In R a multiple regression model could be fitted like this:

 lm(outcome ~ age + gender + race)    

(or whatever variables are in your model, for some outcome suited to multiple regression)

Whereas a logistic regression model could be fitted like so:

 glm(legal.need ~ age + gender + race, family=binomial)    

- as you see, it's not actually much different to specify the model.

If you decide to go the R route, note that there are lots of beginner resources for R (google that italicized phrase for example). R also has its own tag on stackoverflow

Finally, there are some other questions here on stats.stackexchange.com that might help a little, such as this one:

Logistic vs linear regression

If you have some specific followup questions, I'd be happy to try to tackle them.