Solved – Generalized Additive Model (k value)

generalized-additive-modelrregressionstatistical significance

I am trying to have result with GAM using R.
In R, I am using mgcv and the code is following.

enter image description here

However, I do not understand what k value for?
If I want to see the seasonal effect on time then should I put k=12? (my data is daily variables from 2000.01.01~2003.12.31.)
please help me. 🙁

Best Answer

k is the number of basis functions to use for each smooth term before any identifiability constraints are applied. Typically, for 1d bases, there will be one fewer basis functions than implied by k because of a lack of identifiability with the model intercept, but if may be even lower if other constraints apply.

The number of basis functions used to represent a smooth function puts an upper limit on the wiggliness of that function. The smoothness penalty(ies) for term determine the actual wiggliness of the smooth

k is best thought of as providing information as to the maximum allowed wiggliness of each smooth term.

For seasonal effects, you might want to use a cyclic spline basis; two are available in mgcv

  1. a cyclic cubic regression spline basis, and
  2. a cyclic P-spline basis.

If I had sufficient data, I would start with k = 10 or k = 15 for a term used to model the seasonal cycle, but this would only be required if I were including a smooth of Day of Year for example. I doesn't seem like you are doing this.

Just because you have daily data doesn't mean that smooth function of covariates are cyclic or need to have high numbers of basis functions to capture some sine-wave-like behaviour. For example; temperature varies seasonally both within and between days, but the effect of temperature on the response is unlikely to be cyclical.

So, for the terms you show in the model in the question, it doesn't seem like you need k = 12 or to use a cylic basis.

For terms like s(temp), s(hum), s(sea_pressure), and s(wind_speed), it is more likely that the is some effect as you increase / decrease any of these covariates but that the effect may saturate (the rate of increase / decrease gets slower as the covariate values increase)

The smooth of s(time) might require a more specialised behaviour. If this is time of day, then a cyclic smoother will enable 00:00 and 24:00 to have the same estimated effect; if you don't assume there to be a discontinuity at midnight then a cyclic spline will ensure this.

If s(time) is related to the date of observation (so to capture a longer term trend in the data) then you may not want a cyclic smoother; it would force the two end-points of the time covariate to have equal effect, which is not what you would want if there is an increasing or decreasing trend over time.

You seem to have several indicator variables for the day of week. These are best recorded as a single factor variable in R, say day_of_week, with levels:

c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday')

R will then work out the correct dummy variables (or other contrast coding if required/specified).

A final comment; don't include the name of the data frame containing the data, D, in the model formula. If you want to predict, the predict() function will look for variables with names of the form D$foo in the new data supplied. As the data argument already allow you to indicate where the covariates come from, you will save yourself a lot of problems down the line if you exclude the D$ bits from all your terms.