Solved – Correlation formula for a Quadratic

correlationmultiple regressionr-squaredregression coefficients

I have used quadratic regression on a dataset to find the graph of best fit, that is, finding the coefficients a, b and c in the general formula of y = ax^2 + bx + c.

Having done that I would now like to find the correlation coefficient of that fit to the data. Can anybody help with either the formula for the correlation coefficient or the coefficient of determination for a quadratic?

Best Answer

I notice that for these sort of questions there is always a lot of pedantry in the community about the use of the term "correlation". Us non-statisticians use the term to generally mean "relationship", but some people might not get that. So like others have told you, you can't compute the correlation coefficient for a non-linear relationship such as a quadratic relationship. However, you can measure the Root Mean Squared Error and Adjusted R-squared, which will tell you about the "goodness of fit" of your model. You can also do an F-test, which will tell you how much better your model is compared to a degenerate model consisting of only a constant term. All of these measures can be computed in Matlab using the function fitnlm. I know it's been a while since this question was posted so you probably figured this out, but this could still help others. Best of luck.

Related Solutions

Solved – It is correct to use r squared instead of r for correlation of 2 variables

There are two issues at play here: The mathematics of statistics, and the conventions of communication of statistics. You're right that it's unconventional to report $R^2$ for a correlation, at least in most fields. But there's nothing wrong with it mathematically.

You can see this more clearly if you consider the case of simple univariate linear regression (a regression model with one continuous dependent variable and one continuous predictor). TO demonstrate, I'll use the iris dataset, which comes built into R. Here are the first six lines:

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

I can calculate the correlation between Sepal.Length and Sepal.Width

> cor(iris$Sepal.Length, iris$Sepal.Width)
[1] -0.1175698

I'll square that correlation and save it as Rsq for comparison with the regression output.

> r <- cor(iris$Sepal.Length, iris$Sepal.Width)
> Rsq <- r^2

A simple linear regression predicting Sepal.Length from Sepal.Width:

> summary(lm(Sepal.Length ~ Sepal.Width, data = iris))

Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = iris)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5561 -0.6333 -0.1120  0.5579  2.2226 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.5262     0.4789   13.63   <2e-16 ***
Sepal.Width  -0.2234     0.1551   -1.44    0.152    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8251 on 148 degrees of freedom
Multiple R-squared:  0.01382,   Adjusted R-squared:  0.007159 
F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519

> Rsq
[1] 0.01382265

Note that the Multiple R-squared statistic reported is exactly the same as the squared correlation between the two predictors. Of course, this works just as well if you reverse which variable is the predictor and which is the outcome in the regression model:

> summary(lm(Sepal.Width ~ Sepal.Length, data = iris))

Call:
lm(formula = Sepal.Width ~ Sepal.Length, data = iris)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1095 -0.2454 -0.0167  0.2763  1.3338 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.41895    0.25356   13.48   <2e-16 ***
Sepal.Length -0.06188    0.04297   -1.44    0.152    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4343 on 148 degrees of freedom
Multiple R-squared:  0.01382,   Adjusted R-squared:  0.007159 
F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519

When you have more than one predictor in a regression model, then $R^2$ is the squared multiple correlation instead of just the squared bivariate correlation. But the idea behind it is very much the same.

The conventions around reporting statistics often obscure how similar many of our tests and measures are; $r$ and $R^2$ are a great example of that.

Solved – Coefficient of determination $R^{2}$ for each variable in multiple regression

The coefficient of determination is defined for the model as whole and not for individual variables. However, there is a technique called ANOVA which can roughly be thought of as breaking $R^2$ into contributions from each variable.

Recall that the coefficient of determination is defined in terms of the sums of squares of residuals:

$$ \begin{align} R^2 & = 1 - {SS_{\rm res}\over SS_{\rm tot}} \\ SS_\text{tot} & =\sum_{i=1}^n (\bar{y} - y_i)^2 \\ SS_\text{res} & =\sum_{i=1}^n (\hat{y}_i-y_i)^2 \\ \end{align}$$

Where $\hat{y}$ is the prediction vector of the model. Since we can't make a prediction $\hat{y}$ without considering all of the variables in the model.

But look at the equation for $SS_\text{tot}$ more again. This has the exact same form as the $SS_res$ if it were a trivial model with only an intercept term; such a model would predict $\hat{y}_i = \bar{y}$ for all $i$. This suggests that we are not comparing one model to some platonic ideal, but actually comparing two different models. This insight can be generalized into a chain of models:

$$ \frac{SS_1}{SS_{\text{tot}}} + \frac{SS_2}{SS_{\text{tot}}} + ... + \frac{SS_k}{SS_{\text{tot}}}= 1 $$

If we consider a chain of models, starting from the intercept only model and adding one variable at a time, then the quantity $\frac{SS_j - SS_{j-1}}{SS_\text{tot}}$ can be intrepretted as the "amount of variance explained by the $j$-th variable. As a concrete example, here is the output of the anova() function on the built-in airquality dataset:

Analysis of Variance Table

Response: Ozone
           Df Sum Sq Mean Sq F value    Pr(>F)    
Solar.R     1  14780   14780 33.9704 6.216e-08 ***
Wind        1  39969   39969 91.8680 5.243e-16 ***
Temp        1  19050   19050 43.7854 1.584e-09 ***
Month       1   1701    1701  3.9101   0.05062 .  
Day         1    619     619  1.4220   0.23576    
Residuals 105  45683     435

This is called the "sequential" analysis of variance. The Sum Sq column sums to the total sums of squares of the entire dataset, so we can see that Wind explains twice as much variance of Temp. This interpretation is subject to many caveats: it is sensitive to the order in which variables are added, and the F-scores and associated P values on the left are only meaningful for a purely linear model, etc. Nevertheless, if we take that Sum Sq column and divide by total sums of squares:

Solar.R    0.12
Wind       0.33
Temp       0.16
Month      0.01
Day        0.01
Residuals  0.38

We get a table where ever line item is roughly analogous to the quote-unquote "$R^2$" for each variable (plus one line item for the unexplained residual), although that terminology is never used, as far as I know. People talk about the proportion of variance explained instead.

Here are some additional resources if you want to read further:

Best Answer

Related Solutions

Solved – It is correct to use r squared instead of r for correlation of 2 variables

Solved – Coefficient of determination $R^{2}$ for each variable in multiple regression

Related Question