Solved – Building a model that can estimate the equation of a parabola, trained on sample ‘trajectories’

curve fittingmachine learningrsplines

I discovered a parabolic relationship between time and a quantity in my time series data that looks like the one below:

How do I go about building a model that can learn the shape of these parabolas from thousands of samples, and estimate the equation for the curve based on the initial trajectory of a new series (e.g., t=0 to 5).

I am not very technical and I have no idea if this is a reasonable request.

The quantity is viewers per minute of a video stream. Trajectory is metaphorical. At the end of the day I want to be able to input some array of t(3,5,8..n) of viewers for n minutes and get a better and better projection of the curve the more data points I give it, such that I can estimate max height and area under the curve, and do backtesting.

Some resources I found that seem relevant:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.smooth.spline.html and
Equation of a fitted smooth spline and its analytical derivative

I can fit a spline to trajectory but I don't know where to go from there:

d.spl <- with(d, smooth.spline(t, viewers))
d.spl

Call: smooth.spline(x = t, y = viewers)

Smoothing Parameter spar= 0.2297785 lambda= 0.0000001153716 (12
iterations) Equivalent Degrees of Freedom (Df): 60.23867 Penalized
Criterion (RSS): 10483.72 GCV: 175.7969

Best Answer

A quick answer, not robust but a good starting place, which I'll add to should I have time (but I hope is helpful). And if you like the approach and want to keep going in that way, I would look into Nearest Neighbor Classifiers.

Since the data looks rather comically consistent -- that's quite a pattern -- I would start by choosing a simple functional form, and a parabola sounds like a great idea. Simple b - ax^2 sort of shape. You can do this in R with

fit <- lm(y~poly(x,2,raw=TRUE))

where y are your viewers, and x is time, for EACH video. You'll get back the polynomial coefficients.

Then, now that you have something where hopefully you've removed some of the "noise" from the shapes, I would choose a simple distance metric (say Euclidean to start) and measure the distance between your video-to-predict (VtP), and the first k points (k minutes of viewers) of all of the videos you've seen already. (To do this, you'll need to generate k points from the curves you've fit, to compare to the k points you have from your VtP. You COULD just compare points-to-points directly, but I think there will be more over-fitting, so this might be an important regularization.)

THEN, you do one of a few things. Either, you just choose the curve that's closest, and assume things will go like that until the end. You're done! But what if the curve isn't a great fit? Well, then you could choose the 5 nearEST neighbors, and average their parameters (weighted by their closeness, maybe -- lots and lots of tweaks possible), then predict with THAT curve, using

weighted_mean(a_close) - weighted_mean(b_close)x^2

The biggest problem with this approach is that assuming a parabola is quite strict, but conceptually it's easier to think about mixing several polynomials, than it is to think about mixing several splines, and the mixing will be important, probably. You'll want to express each new curve as some combination of the old curves, and it's necessary to have a mix-able way of expressing them, to do that.

You could just try to average the five nearest-neighbor-curves directly, as well, but it will be harder to extrapolate to trajectories unlike those you've seen before -- with a parametric form, that would be easier.

Related Solutions

Solved – Spline fitting in R – how to force passing two data points

Rather than use smooth.spline() in the stats package, there is a function cobs() in the cobs package that allows you to do exactly the sort of thing you want. COBS stands for Constrained B-splines. Possible constraints include going through specific points, setting derivatives to specified values, monotonicity (increasing or decreasing), concavity, convexity, periodicity, etc.

In your case, use

cobs(x, y, pointwise=rbind(c(0,-100,-1),c(0,100,1)))

Solved – Back-testing or cross-validating when the model-building process was interactive

FYI, this might be more appropriate for SE.DataScience, but for the time being, I'll answer it here.

It seems to me like you might be in a situation where you will have no choice but to write a script that will implement your solutions. Never having worked with splines, my knowledge of them is strictly theoretical so please bear with me and let me know if there is anything I'm not seeing.

Broadly speaking, it appears that you have a couple of different items that you will have to resolve in order to implement this.

1.) Determining the model parameters in a dynamic fashion. You have previously mentioned that you've used a combination of domain knowledge and univariate measures. That seems to me like something that you should be able to handle heuristically. You will have to agree at the outset on a set of rules which your program will implement. This may or may not be a trivial task as you will have to do some hard thinking about the potential implications of those rules. This may require you to re-visit every step of your process and cataloging not just the decisions, but also the reasons behind those decisions.

2.) Actually implementing your program. In order to make your performance testing properly dynamic and easy to maintain and modify going forward, you will have to think about how you're going to structure it. You will likely want to use some sort of loop for your main model predictive performance estimation, preferably with a user-definable length in order to allow for greater flexibility going forward. You will also likely want to write separate functions for each action that you want your program to take as this will make it easier to test functionality, and to maintain and modify your program going forward. You will, at a minimum, likely need functions for dataset selection (i.e. only time periods that have "gone by" at the moment of backtesting), cleaning and validation (which you'll really have to think about, as data munging is a critical part of model building), functions for model training parameters, and functions for model prediction and performance measure collection and storage.

Your question about outlier detection and handling also falls under those two concerns and I would go about implementing by writing smaller loops within your main program loop that would continue to "clean" and refit the model until it's reached a point where you would be happy with it (which again, you'll have to define yourself).

If this sounds like a big task, it's because it is; people have written entire software libraries (sometimes very lucratively) in order to perform this sort of task. Beyond that, it's hard to offer any more specific advice without knowing more about your processes, data structure, and the programming language you've done your work in thus far.

If any of this of useful to you and you'd like me to expand on any of it, comment, let me know, and I'd be more than happy to do so.

Best Answer

Related Solutions

Solved – Spline fitting in R – how to force passing two data points

Solved – Back-testing or cross-validating when the model-building process was interactive

Related Question