Solved – Machine learning models for regression on small data sets

machine learningpredictive-modelsregressionsmall-sample

What are the "best" models to be used for simple regression of 1 numerical variable using only a small data set of e.g. 250 samples and up to 10 features?

I understand that the data set is super small (even smaller if one applies e.g. a 60%/40% train-test split) and that this carries a high risk of over-fitting especially when using complex models like neural networks.

What would be a reasonable model to use in such a case and what would be the best way to avoid over-fitting? Note that I do not know if relationships are linear or if all features are necessarily helpful.

Best Answer

Small datasets and few features are a domain where traditional statistical models tend to do very well, because they offer the ability to actually interpret the importance of your features.

I'm assuming by "simple regression" you mean predicting a real-valued, continuous variable y from your input variables. You mention that you suspect you may have non-linear relationships, and you don't know much about the importance of the features.

My instincts in this case would be to use a generalized additive model (GAM), like the mgcv package for R. mgcv has very nice default methods for choosing some of the more arcane parameters in a GAM, like how many knots and where to put them.

Maybe you have three predictors, x1, x2, and x3, where x1 and x2 are continuous and x3 is a categorical variable. In this case you could do (in R):

library(mgcv)
x3 <- as.factor(x3)
my.model <- gam(y ~ s(x1) + s(x2) + x3, method = "REML")
summary(my.model)
plot(my.model, shade=TRUE, pages=1)

That last part about using REML is personal preference. It sets how "wiggly" the nonlinear curves are allowed to be. The default method uses, if I recall, generalized cross-validation, which works fine though in my experience tends to give "wigglier" curves.