Solved – Coefficient p-values for linear regression on large dataset

anovacategorical datahypothesis testingregression

I am running an experiment. There are two features: Gender and Area.

Gender has two levels (Male/Female) and Area has four (North/South/East/West).

The dependent variable is height. My sample size is really big (about a million people).

My goal is to find out whether/how much of the difference in height between men and women is due to the area they live in.

Because the data groups are of different sizes, and I can't guarantee normality of distribution of homogeneity of variance, I didn't think I could use ANOVA. Instead, I was trying to use linear regression with dummy variables and check whether the interaction terms were significant. I got this idea by reading https://en.wikipedia.org/wiki/Dummy_variable_%28statistics%29#Interactions_among_dummy_variables.

Is this the right approach, and if so, how do I find out whether the individual coefficients are significant if I have a really large data set? I have read about some examples involving popular statistics packages that gave a std. error, t-statistic and p value, but given how much data I am using I wasn't sure they would be able to handle it. I know there are faster ways to get (or at least to accurately estimate i.e. stochastic gradient descent) just the coefficients, but I don't know about the standard error.

Best Answer

There are other approaches, but without random assignment you are going to have a hard time assigning causality no matter what statistical approach you use. You can find out the degree of relationship between area lived in and height. Using regression (in this context) does not free you from the assumption of normality of residuals. Given your categorical variables ANOVA may be simpler (although regression can perform as an ANOVA). You can always go back and do post-hocs that do not make the homogeneity of variance assumption if you are very worried about it (e.g. Welch's t tests with appropriate p-value corrections for multiple comparisons). With the sample size you are talking about I wouldn't be worried about statistical power (making Type-II errors) just because practically significant effects on height are likely to survive even brutal p-value corrections.

You can use dummy variables and look at the interaction. Each coefficient should have a significance value attached to it which tells you about the comparison between that group and the groups you coded as 0. For example, if Area 1 was coded as 0 and Male was coded as 0 then the coefficient for Gender x Area2 would tell you if the effect of gender in Area 2 differed from the effect of gender in Area 1... and so on. For simplicity to answer all of your potential questions you may need to recast the equation a couple times selecting different dummy codes.

A million cases might seem like a lot, but non-student versions of most mainline statistical packages will be able to handle a million cases without any problem (citations: SPSS, SAS, R). I know that doesn't really address the tail end of your question there, but for now my answer is that the premise is wrong.