I know that R has gam and mgcv libraries for generalized additive models. But I am having difficulty finding their counterparts in the Python ecosystem (statsmodels only has prototype in the sandbox). Is anyone aware of existing python libraries? Who knows this might be a good project to develop/contribute to scikit-learn if not.
Solved – Generalized Additive Model Python Libraries
generalized-additive-model
Related Solutions
As mentioned in the comments, a propensity to overfit is a limitation of GAMs. Another limitation is that the model will lose predictability when the smoothed variables have values outside of the range of training dataset. Essentially, you are sacrificing predictability outside of your data range for precision within your data range.
k
is the number of basis functions to use for each smooth term before any identifiability constraints are applied. Typically, for 1d bases, there will be one fewer basis functions than implied by k
because of a lack of identifiability with the model intercept, but if may be even lower if other constraints apply.
The number of basis functions used to represent a smooth function puts an upper limit on the wiggliness of that function. The smoothness penalty(ies) for term determine the actual wiggliness of the smooth
k
is best thought of as providing information as to the maximum allowed wiggliness of each smooth term.
For seasonal effects, you might want to use a cyclic spline basis; two are available in mgcv
- a cyclic cubic regression spline basis, and
- a cyclic P-spline basis.
If I had sufficient data, I would start with k = 10
or k = 15
for a term used to model the seasonal cycle, but this would only be required if I were including a smooth of Day of Year for example. I doesn't seem like you are doing this.
Just because you have daily data doesn't mean that smooth function of covariates are cyclic or need to have high numbers of basis functions to capture some sine-wave-like behaviour. For example; temperature varies seasonally both within and between days, but the effect of temperature on the response is unlikely to be cyclical.
So, for the terms you show in the model in the question, it doesn't seem like you need k = 12
or to use a cylic basis.
For terms like s(temp)
, s(hum)
, s(sea_pressure)
, and s(wind_speed)
, it is more likely that the is some effect as you increase / decrease any of these covariates but that the effect may saturate (the rate of increase / decrease gets slower as the covariate values increase)
The smooth of s(time)
might require a more specialised behaviour. If this is time of day, then a cyclic smoother will enable 00:00 and 24:00 to have the same estimated effect; if you don't assume there to be a discontinuity at midnight then a cyclic spline will ensure this.
If s(time)
is related to the date of observation (so to capture a longer term trend in the data) then you may not want a cyclic smoother; it would force the two end-points of the time
covariate to have equal effect, which is not what you would want if there is an increasing or decreasing trend over time.
You seem to have several indicator variables for the day of week. These are best recorded as a single factor variable in R, say day_of_week
, with levels:
c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday')
R will then work out the correct dummy variables (or other contrast coding if required/specified).
A final comment; don't include the name of the data frame containing the data, D
, in the model formula. If you want to predict, the predict()
function will look for variables with names of the form D$foo
in the new data supplied. As the data
argument already allow you to indicate where the covariates come from, you will save yourself a lot of problems down the line if you exclude the D$
bits from all your terms.
Best Answer
I've written a Python implementation of GAMs using penalized B-splines.
check it out here: https://github.com/dswah/pyGAM
I've included lots of link functions, distributions and features.