Solved – Linear Regression with a small sample size

regressionsmall-sample

For part of my Masters project I want to analyse the the relationship between the age of some log piles and their invertebrate diversity.

I think that I will need to do some sort of regression analysis with this but my sample sizes for each of the log pile are rather small and uneven. I have 4 piles that are 1 year old, 8 that are 2 years old, and 17 that are 3 years old.

I'm not very experienced with statistics so a lot of the nuances of the various methods are lost on me, I know that sample size is important for regression analysis but I'm stuck with these very small sample sizes. Given this, is there anything I should keep in mind going forward, or an alternative that wont be affected so much by the small sample sizes?

Best Answer

To test the hypothesis that "The presence of log piles improves biodiversity at the site" you need to have something not mentioned in your question: sites without log piles. The null hypothesis that you would subject to statistical testing would then be that invertebrate diversity is unaffected by the presence of a log pile at a site.

For testing that null hypothesis, it's the number of sites with and without log piles that will provide power, not the number of log piles per se. So if each site with a log pile has exactly 1 pile, you already have 29 sites with log piles. If you have 29 appropriately matched sites without log piles, then you could be in pretty good shape. The hypothesis might be adequately assessed by a simple t-test for the two types of site. It would in principle be possible to include other factors like the age of the pile or (if a site might have more than 1 pile) the number of piles per site, with a multiple regression.

You do need, however, to consider the nature of your response variable and whether that will allow you to meet the usual statistical testing assumptions. For example, your "effective species number" is necessarily non-negative; depending on the numbers of species, it might not be possible for the errors in the regression model to show the normal distribution that is assumed by many standard statistical tests. That's not necessarily a problem, as you could use some type of resampling of the data to provide a test that doesn't depend on such assumptions.

If the "effective species number" is necessarily an integer, you might consider a Poisson regression, often appropriate for count data.

You also should consider whether your "effective species number" is the best measurement of biodiversity for your project. Other types of diversity index capture population size differences among the species in addition to the number of species. If you do choose to use a different measure, you will have to evaluate whether the assumptions of using t-tests or standard linear regression tests will be appropriate. For example, some measures of diversity are restricted to values between 0 and 1, in which case a beta regression might be called for.

Your supervisor should be familiar with these issues and able to provide advice on the best way to proceed based on the specifics of your project. Also, do take the advice provided by James Phillips in a comment: it's best to look at the data directly before you go too far down the road toward fancy statistical testing.