Regression – Calculating Confidence Intervals for Log-Log Regression in Python

logarithmmodelingpythonregression

I've attempted to build a simple regression model with the following data points:enter image description here

where x is some metric related to a product, and y is the sales of that product over a year. The aim being I'd like to build a model so I can use this metric to make a very rough estimation of a products returns in its first year.

You'll notice both axes are logscale. I fitted a linear log-log regression model using scipy curve_fit like so:

x = np.array(data.metric)
y = np.array(data.revenue)

plt.figure(figsize=(15,7))
plt.plot(x, y, 'x')

def func(x, p1, p2):
    return np.exp(p1*np.log(x)+p2)

popt, pcov = curve_fit(func, x, y)
fittedYData = func(x, popt[0], popt[1])

plt.plot(x, fittedYData, '-', 'r-')

plt.xscale('log')
plt.yscale('log')
plt.show()

Question 1

Have I picked the right model?

The fit seems pretty good as far as I can tell, although I noticed the line of best fit seems a bit high. I was wondering if, because of the exponential nature of financial data and therefore the skew of the mean towards higher revenues, whether a median-based model like quantile regression might be better than a mean-based model like least-squares used here? Or something completely different?

Question 2

How would I best introduce confidence bands so I can ask questions such as; "if i have a product with a metric value of X, whats the range in which I can confidently expect the returns after a year to lie between?"

Pointers to resources also welcome!

Best Answer

A scatterplot won't convince anyone that you have (or haven't) fit the right model. That said, the so-called line of best fit should, in fact, be a best fit. Now it should be noted the eye is typically a bad judge of the "best fit" in terms of minimizing the squared residuals. The line our eye prefers tends to fall a bit flatter than OLS because squared errors tend to weight much higher than we expect. However, I can plainly see the fit line here is a poor fit.

There are two regressions linearly relating the log of x to the log of y. The complementary log log fit which is a non-linear least squares that minimizes the squared residuals on the untransformed Y-scale. That's what you've fit here. The other fit is an OLS for the log of Y on the log of X. This one minimizes the squared errors on the scale of the log Y.

Because the log is a concave functional, the non-linear least squares approach will tend to fit a line of best fit that looks "too high" when plotted on the log scale. This is because the positive residuals had a larger impact on the original scale. That is why your line of best fit looks bad when plotted on the log-x, log-y coordinates.

If you want standard confidence bands and a nice looking line of best fit, do an OLS with the log of x and the log of y.

Related Question