Solved – Bootstrapping – with population data (dependent variable is not normally distributed)

bootstrap

Thank you in advance for help! I am conducting a study using General Linear Modeling on the distribution of financial aid at a college. I am not looking to project my findings onto a larger population. I only want to make statements/conclusions about this one school. The dataset has over 1000 students, each of whom were offered an scholarship.

My dependent variable (the scholarship award) is not normally distributed. The residuals of the model show non normal distribution and I have tried to take a log of my dependent variable… still not normally distributed.

Bootstrapping was suggested to me as a possible solution. Can someone help me understand if bootstrapping might change the nature of my data? Can I interpret the output of the GLM parameter estimates the same way?

Thank you very much.

Best Answer

I'm not sure what you mean by "change the nature of your data"; bootstrapping won't make your response normally distributed, if that's what you mean. What it does is give you an alternative method of finding confidence intervals, if you don't have an easy way to construct confidence intervals directly or you don't trust the confidence intervals that your method is spitting out for some reason (e.g., one of the assumptions that affects the size of the confidence interval is not met).

In bootstrapping, you create some large number $B$--10,000, say--of data sets which are derived from your original data set by drawing each one with replacement from your original data set. So each of your new datasets contains observations from your original data set, but it may contain multiples of some observations and other observations it may not contain at all. Then you run your method on each of these new datasets. This gives you a sort of empirical "bootstrap distribution" of what your parameter estimates will look like for datasets like your original data set; you then use this bootstrap distribution to create confidence intervals.

What this means for you is that you ignore the confidence interval results from the output of your GLM; you interpret the parameter estimate results the same way, but you use the parameter estimates from these 10,000 bootstrapped datasets to create confidence intervals rather than using the theoretical ones, which may not be justified if your assumptions are violated.

Related Question