Solved – Improvement of regression model

model selectionmultiple regressionpolynomialrregression

I am just learning R. I have developed a regression model with six predictor variables. While developing it, I found the relationships are not very linear. So, maybe because of this the predictions of my model are not exact.

Here is Headers of my data set:

1.bouncerate(To be predicted)
2.avgServerResponseTime
3.avgServerConnectionTime
4.avgRedirectionTime
5.avgPageDownloadTime
6.avgDomainLookupTime
7.avgPageLoadTime

Sample datasets:

28.57142857,4.132,0.234,0,0.505,0,14.168
42.85714286,3.356777778,0.090777778,0.077333333,0.459,0.105444444,14.78644444
0,3.372,0.1105,0.0015,0.425,0.1305,34.3425
33.33333333,3.583,0.218,0,0.385,0.649,11.816
66.66666667,2.438,0.234,0,0.3405,0,8.645
100,2.805,0.179666667,3.203666667,0.000333333,0.11,13.47066667
66.66666667,0.977,0,0.003,0,0,12.847
0,2.776,0,7.888,0,0,14.393
100,2.59,0.261,0,0.517,0,6.216

Here is the summary of my model:

Call:
lm(formula = y ~ x_1 + x_2 + x_3 + x_4 + x_5 + x_6)

Residuals:
     Min       1Q   Median       3Q      Max 
-125.302  -26.210    0.702   26.261  111.511 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 48.62944    0.27999 173.684  < 2e-16 ***
x_1         -0.67831    0.08053  -8.423  < 2e-16 ***
x_2          0.07476    0.49578   0.151 0.880143    
x_3         -0.22981    0.06489  -3.541 0.000399 ***
x_4          0.01845    0.09070   0.203 0.838814    
x_5          3.76952    0.67006   5.626 1.87e-08 ***
x_6          0.07698    0.01565   4.919 8.75e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 33.76 on 19710 degrees of freedom
Multiple R-squared: 0.006298,   Adjusted R-squared: 0.005995 
F-statistic: 20.82 on 6 and 19710 DF,  p-value: < 2.2e-16

plot with all single variable are below:
bouncerate vs avgServerConnectionTime
bouncerate vs ServerResponseTime
bouncerate vs avgRedirectionTime
bouncerate vs avgDomainLookupTime
bouncerate vs avgPageLoadTime

I have certain questions about this model:

  1. Is there any way to improve the accuracy of this model?
  2. Which of the values is most useful: residual standard error, degrees of freedom, multiple R-squared, adjusted R-squared, F-statistics, or p-values for choosing best model?
  3. Is it appropriate to use polynomial transformations with these data?
  4. In case I do use polynomial terms in my model, which degree is most appropriate?

Best Answer

@Roland is correct that it's hard to say much without knowing what you're doing, substantively speaking. However, there are a few remarks we can still make. They fall into the categories: discovering why it's no good, making it better, and demonstrating improvement.

Diagnostics

R has good linear model diagnostics. Apply them, and read up enough to know what they are telling you. To see all the available ones

model <- lm(formula = y ~ x_1 + x_2 + x_3 + x_4 + x_5 + x_6)
plot(model, 1:6) ## all of them

Each addresses a possible failing. You might check for linearity and interactions first because you've enough data to do something about them.

Making it better

You have lots of data. This means that if there is non-linearity you can potentially learn its form from the data. A generalised additive model (GAM) would be a good start and will probably work better than some random set of polynomials. If you don't want or can't do that, then at least some splines might be helpful.

Also, work your way through the interactions that make sense. These will generate apparent non-linearity and spoil predictions if not modeled. Read up about R's formula interface to see how to specify them.

Polynomials can work, but without knowing what your data actually is it's hard to say whether they'd be a good idea. Also hard to say, and for the same reasons, is whether your predictor variables might be usefully transformed (logged, etc.)

Confirming it's better

Since your only task is to make the model better then the only quantity worth working with is held-out prediction error. Do whatever you do on a subset of the data then try it out on the held-out set. (Iteratively this is cross-validation). You have to decide what counts as 'doing better' prediction in the context of your problem, but a common choice is root mean squared error. Here again I'm assuming that you actually do have data that is potentially conditionally normal, as your choice of lm implies.

Practically this would involve writing a function to compute that quantity (or one suitably like it) from a set of predictions and a set of held-out data points. The do your fiddling around and optimizing the model on the other part of the data, use predict to get predictions on the held-out, and apply the function.

Note that performance on held-out data is not any of the quantities you are wondering about. Those are all in-sample measures and will typically overestimate prediction performance on new data.

Caveats

Finally, note that prediction may just be hard. You may not have the right variables: most likely some important ones are missing, and you can do nothing more about that without knowing what they are.

And that's about as much generic advice as can be given for a bunch of variables called $[y, x_1\ldots x_6]$...