FYI, this might be more appropriate for SE.DataScience, but for the time being, I'll answer it here.
It seems to me like you might be in a situation where you will have no choice but to write a script that will implement your solutions. Never having worked with splines, my knowledge of them is strictly theoretical so please bear with me and let me know if there is anything I'm not seeing.
Broadly speaking, it appears that you have a couple of different items that you will have to resolve in order to implement this.
1.) Determining the model parameters in a dynamic fashion. You have previously mentioned that you've used a combination of domain knowledge and univariate measures. That seems to me like something that you should be able to handle heuristically. You will have to agree at the outset on a set of rules which your program will implement. This may or may not be a trivial task as you will have to do some hard thinking about the potential implications of those rules. This may require you to re-visit every step of your process and cataloging not just the decisions, but also the reasons behind those decisions.
2.) Actually implementing your program. In order to make your performance testing properly dynamic and easy to maintain and modify going forward, you will have to think about how you're going to structure it. You will likely want to use some sort of loop for your main model predictive performance estimation, preferably with a user-definable length in order to allow for greater flexibility going forward. You will also likely want to write separate functions for each action that you want your program to take as this will make it easier to test functionality, and to maintain and modify your program going forward. You will, at a minimum, likely need functions for dataset selection (i.e. only time periods that have "gone by" at the moment of backtesting), cleaning and validation (which you'll really have to think about, as data munging is a critical part of model building), functions for model training parameters, and functions for model prediction and performance measure collection and storage.
Your question about outlier detection and handling also falls under those two concerns and I would go about implementing by writing smaller loops within your main program loop that would continue to "clean" and refit the model until it's reached a point where you would be happy with it (which again, you'll have to define yourself).
If this sounds like a big task, it's because it is; people have written entire software libraries (sometimes very lucratively) in order to perform this sort of task. Beyond that, it's hard to offer any more specific advice without knowing more about your processes, data structure, and the programming language you've done your work in thus far.
If any of this of useful to you and you'd like me to expand on any of it, comment, let me know, and I'd be more than happy to do so.
library("ggplot2")
theme_set(theme_bw())
library("mgcv")
df <- data.frame(one = 1*(1:20),
two = c(1,6,2,8,7,4,9,8,5,4, c(1,6,2,8,7,4,3,8,5,4)/2),
three = 3*(1:20))
m <- gam(three ~ s(one, by=two, k = 8, pc=10), data = dfOne)
Q1
What your model is fitting is a centred smooth effect of one
multiplied by two
. The spline itself is quite smooth:
Next note that the estimates of the spline for the first n
elements are all negative and quite similar in magnitude, esp the first few where you see the anticorrelated behaviour:
> head(fitted(m))
[1] 25.936853 4.209230 23.373273 5.188732 12.415398 22.525351
> head(predict(m, type = "terms")[,1])
1 2 3 4 5 6
-5.118848 -26.846471 -7.682427 -25.866969 -18.640302 -8.530349
What you see here is the relatively similar contribution from the smooth being multiplied by the quite different value of two
for these first two elements. As the spline effects are negative, the resulting contribution is somewhat negative for the first value and much more strongly negative for the second value, and so on.
This is all due to the model you specified; estimate a spline ,evaluate the spline at the ith location and multiply that effect by the ith row of the numeric by
variable.
Q2
Knots are not inflexions of the resulting fitted values. In fact, there are as many knots as data points as you used a thin plate spline here. There k
represents the dimension of the basis expansion (in this the dimension is 7 because you loose 1 due to the centring constraints imposed on the spline), and mgcv employs a neat trick to select 7 basis functions from the n
basis functions in a traditional thin plate spline basis. (The basis functions are subject to an eigen decomposition and the eigenvectors associated with k-1
largest eigenvalues are used as a new basis. This concentrates most of the information in the full thin plate spline basis into a smaller set of orthogonal basis functions for model fitting.)
The estimated coefficients clearly indicate that only 7 basis functions were used in the model:
> coef(m)
(Intercept) s(one):two.1 s(one):two.2 s(one):two.3 s(one):two.4 s(one):two.5
31.05570043 0.70385580 -0.65193244 -0.01680226 -0.21851339 0.14293203
s(one):two.6 s(one):two.7
-0.52163338 5.27573517
Q3
As the plot of the spline shows, the spline does pass through x = 10
as you requested.
You are confusing the fitted values of the model with the estimated spline.
Best Answer
I think for a more complete answer, I will wait for somebody more experienced with GAMs to correct what I say here, but this is what I know of the derivatives in GAMs and how the
m
argument works with it. For starters, recall that a first derivative is the slope of a smooth function at a specific point when the limit of a function approaches infinitesimally small increments (Larson & Edwards, 2023, p.103),$$ f'(x) = \lim_{\Delta{x} \to 0}\frac{f(x + \Delta{x})-f(x)}{\Delta{x}} $$
where $x$ is an input and $\Delta{x}$ is the change in distance from that $x$ on the $x$ axis. Basically, at any given point on a smooth function, we can approximate its slope like a linear function. The second derivative conversely is defined as the change in slope for a smooth function, and is simply the derivative of the first derivative (Larson & Edwards, 2023, p.128):
$$ f''(x) = f'(f'(x)) $$
Note that how these derivatives are defined in the context of GAMs as well as their penalties can be a little different, given that a GAM function is typically defined by a number of basis functions rather than a singular function (Wood, 2017, p.162):
$$ f(x) = \sum\limits_{j=1}^{k}b_j(x)\beta_j $$
where $j$ is a given basis function. The actual penalization of this function by default involves finite differences with relation to the second derivative (see a relevant discussion here). In any case, to summarize the derivatives, we have two things: our slopes at a given point, and our rates of change. These can actually be visualized in a fit GAM model using the
gratia
package. Here is an example pulled directly from the Github:If we draw our base model, we get these smooths:
To obtain the derivatives, we use the following function and use
draw
once more:Now if you compare the first and second crop of plots, you can see the first smooth's slope does not change much from about 0 to about .25. As the slope starts to decrease and hit the middle of the plot, its slope is decreasing from a large positive slope to an evermore small slope. This visual change is captured by the second plot, which shows the values of 0 to about .25 as flat until the slope starts decreasing, wherein you see the relationship also decrease in the plot. This decrease in slope continues until you see the flat line at the end, where the slope doesn't change in the original function.
Alternatively, the second derivatives are shown as such:
You can see they look "blocky." This is because of the nature of second derivatives. If for example we simplify things a lot by defining a function as $f(x) = x^3$, we can find that
$$ f'(x) = 3x^2 $$
$$ f''(x) = 6x $$
Notice here that in the case of the defined polynomial here, our function effectively becomes linear at this given point, which contributes to the zig-zaggy behavior of the plot for
df2
.What does this mean within the context of GAMs? We know that TPRS splines and some similar splines by default penalize the second derivatives using
m=2
. If we compare our model here with one that penalizes the first derivative, we can compare them directly with anothergratia
function calledcompare_smooths
, which will show how it changes our respective fits. Here I explicitly setm
here to 1 or 2 to make it clear what I'm doing.We can now directly compare the model
fit1
, which penalizes the first derivative, andfit2
, which penalizes the second derivative:Notice for some smooths it has simplified them, and in others it has made them more wiggly. Now for the million dollar question for which I don't have an exact answer: what does this mean for us? Well in some cases this may make the data fit the model better, and that is enough. Penalizing the right derivative can also lend itself to some other properties as well, as noted in Pedersen et al., 2019:
The best I can say is that you should try to fit your data to the functional relationship that is best defined by the theory and data driving that function, and try when possible to not overfit the data, regardless of the selection of $m$. From my own experience with tinkering with $m$, it can sometimes create functions that way overfit the data where there are a sparse number of data points, and that is something you would of course want to avoid. Here is an extreme artificial example below, where the
sp
andk
arguments are set to cause a lot of bends in the curve:You can see because there are not a lot of data points, this can contribute to some very bad misspecification with a penalized first derivative. Predictions from this model would probably be poor if the actual functional relationship is more smooth:
It sounds like you already know how your data is supposed to behave, and that should give you the right answer anyway about what is an adequate fit to your data.
Citations