Solved – Problems with modeling a cumulative dependant variable

predictive-modelsregression

I am building a .NET program. One of its functions is to provide a predictive model for a vehicles life-to-date maintenance costs, basically what is the cumulative cost(Y) for a vehicle at specific year(X). I decided to use a 2nd degree polynomial least squares fit and for the most part it does a good job. Sometimes though the curve will peak and start trending downward which doesn't make sense for life-date-cost since its a cumulative value…(X,Y) > (X-1,Y).

This negative trend happens when the difference in cost for say, year 2 to year 3 is less than year 1 to year 2. Some sample data that gives me a negative trend:

(1,328.76)
(2,1133.12)
(3,1366.07)

My solution for now is to check for a negative trend and if its found use a linear best fit instead but I feel like that's a messy fix. I've thought about implementing some sort of minimum value for the change from year to year…essentially turning the curve into a linear line at a certain X value but that seems complicated to implement. Does anyone see a better way of doing this or a better model to use? I'm not very knowledgeable with statistics so go easy on me :-p

Edit

Each vehicle has a varying amount of data depending on how long its been in service, with a soft max at 15 years. So the last data point for each vehicle is for the most recent year(2011 in this case) and we are really only interested in extrapolating 5 years beyond that point. As we use the model year to year, we will get more data for the vehicles which require the model to be altered. Thats why I choose the polynomial least squares fit because its easy to just run the new data back through that function and get a new equation.

Best Answer

Using a squared term to capture a curvilinear relationship in your data is fine, but it's worth noting that this can only capture one exact shape of curvature. This can be very limiting in practice. What you want typically is to use smoothing splines. I wrote about this here: What are the advantages / disadvantages of using splines, smoothed splines, and Gaussian process emulators?, which may be helpful to read. However, these techniques are somewhat advanced, and I doubt they can be done in Excel. Moreover, extrapolation is very hard to do well (here's a humorous example), and extrapolation is especially likely to go awry when using splines. Most likely, you will need to work with a statistical consultant.

Related Question