Solved – Building a model that can estimate the equation of a parabola, trained on sample ‘trajectories’

curve fittingmachine learningrsplines

I discovered a parabolic relationship between time and a quantity in my time series data that looks like the one below:
enter image description here

How do I go about building a model that can learn the shape of these parabolas from thousands of samples, and estimate the equation for the curve based on the initial trajectory of a new series (e.g., t=0 to 5).

I am not very technical and I have no idea if this is a reasonable request.

The quantity is viewers per minute of a video stream. Trajectory is metaphorical. At the end of the day I want to be able to input some array of t(3,5,8..n) of viewers for n minutes and get a better and better projection of the curve the more data points I give it, such that I can estimate max height and area under the curve, and do backtesting.

Some resources I found that seem relevant:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.smooth.spline.html and
Equation of a fitted smooth spline and its analytical derivative

I can fit a spline to trajectory but I don't know where to go from there:

d.spl <- with(d, smooth.spline(t, viewers))
d.spl

Call: smooth.spline(x = t, y = viewers)

Smoothing Parameter spar= 0.2297785 lambda= 0.0000001153716 (12
iterations) Equivalent Degrees of Freedom (Df): 60.23867 Penalized
Criterion (RSS): 10483.72 GCV: 175.7969

enter image description here

Best Answer

A quick answer, not robust but a good starting place, which I'll add to should I have time (but I hope is helpful). And if you like the approach and want to keep going in that way, I would look into Nearest Neighbor Classifiers.

Since the data looks rather comically consistent -- that's quite a pattern -- I would start by choosing a simple functional form, and a parabola sounds like a great idea. Simple b - ax^2 sort of shape. You can do this in R with

fit <- lm(y~poly(x,2,raw=TRUE))

where y are your viewers, and x is time, for EACH video. You'll get back the polynomial coefficients.

Then, now that you have something where hopefully you've removed some of the "noise" from the shapes, I would choose a simple distance metric (say Euclidean to start) and measure the distance between your video-to-predict (VtP), and the first k points (k minutes of viewers) of all of the videos you've seen already. (To do this, you'll need to generate k points from the curves you've fit, to compare to the k points you have from your VtP. You COULD just compare points-to-points directly, but I think there will be more over-fitting, so this might be an important regularization.)

THEN, you do one of a few things. Either, you just choose the curve that's closest, and assume things will go like that until the end. You're done! But what if the curve isn't a great fit? Well, then you could choose the 5 nearEST neighbors, and average their parameters (weighted by their closeness, maybe -- lots and lots of tweaks possible), then predict with THAT curve, using

weighted_mean(a_close) - weighted_mean(b_close)x^2

The biggest problem with this approach is that assuming a parabola is quite strict, but conceptually it's easier to think about mixing several polynomials, than it is to think about mixing several splines, and the mixing will be important, probably. You'll want to express each new curve as some combination of the old curves, and it's necessary to have a mix-able way of expressing them, to do that.

You could just try to average the five nearest-neighbor-curves directly, as well, but it will be harder to extrapolate to trajectories unlike those you've seen before -- with a parametric form, that would be easier.

Related Question