Solved – Different regression coefficients in R and Excel

excellinearrregressionregression coefficients

Asking a separate question because whilst this has been answered for polynomial regression the solution doesn't work for me.

I'm performing a simple linear regression. Both R and Excel give the same P values, $R^2$ values and F statistics, but they give different coefficients (and therefore equations to put on the graph?)

This is the data:

suva Std dev heather sedge sphagnum 12.880545 0.061156645 0.15 0.525 0 7.098873327 0.026878039 0.2275 0 0 8.660688381 0.04037841 0.425 0.25 0 7.734546932 0.021618446 0.225 0.3875 0 16.70696048 0.103626684 0.15 0.075 0 9.763315183 0.013387158 0.25 0.075 0 12.91735434 0.008076468 0.22 0.22 0 19.94153851 0.150798057 0.0375 0.35 0.225 17.25115559 0.052229596 0.0625 0.2625 0.225 15.38596941 0.05429447 0.1125 0.45 0.025 15.53714185 0.05933884 0.1625 0.525 0.0625 14.11551229 0.064579437 0.1875 0.35 0.1375 14.88575569 0.0189853 0.3375 0.3 0 12.32229733 0.043085602 0.0875 0.1375 0 17.23861185 0.071705699 0.15 0.1375 0 11.50832463 0.1125 0.0875 0.075 14.4810484 0.078476821 0.0375 0.125 0.0625 9.110262652 0.077306938 0.145 0.35 0.0125 10.8571733 0.02681341 0.0375 0.525 0 9.589339421 0.01892435 0.2275 0 0 7.260373588 0.014538237 0.425 0.25 0 11.11099161 0.022802578 0.225 0.3875 0 10.81488848 0.047587818 0.15 0.075 0 8.224131957 0.031126904 0.25 0.075 0 8.818607863 0.002855409 0.22 0.22 0 11.53999863 0.031465613 0.0375 0.35 0.225 14.92784964 0.069998663 0.0625 0.2625 0.225 9.666480932 0.02387741 0.1125 0.45 0.025 12.51000758 0.016960259 0.1625 0.525 0.0625 13.32611463 0.033670382 0.1875 0.35 0.1375 16.76535191 0.029613698 0.3375 0.3 0 11.24615281 0.008440059 0.0875 0.1375 0 10.60564875 0.003930792 0.15 0.1375 0 11.82909125 0.036017582 0.1125 0.0875 0.075 18.2337185 0.143451512 0.0375 0.125 0.0625 10.6226222 0.020561242 0.145 0.35 0.0125

Excel gives the following output:

Regression Statistics

Multiple R = 0.420431825
R Square = 0.176762919
Adjusted R Square = 0.152550064
Standard Error = 3.054032651
Observations = 36

ANOVA
          df         SS           MS          F           Significance F
Regression 1 / 68.0914386 / 68.0914386 / 7.300374813 / 0.010676279

Residual 34 / 317.1219248 / 9.327115435

Total 35 / 385.2133634

Coefficients
            Coefficients    Standard Er      t Stat       P-value   
Intercept 14.69082425 / 0.996953237 / 14.7357205 / 2.50758E-16

Heather% -13.78771857 / 5.102932589 / -2.701920579 / 0.010676279

Coefficients con't

Lower95% Upper 95% Lower 95.0% Upper 95.0%

12.66477151 / 16.71687699 / 12.66477151 / 16.71687699

-24.15812531 / -3.41731183 / -24.15812531 / -3.417311838

Whilst R gives the following output:

HS <- lm(heather ~ suva, data=data1)
summary(HS)

Call: lm(formula = heather ~ suva, data = data1)

Residuals:
  Min        1Q    Median        3Q       Max 
-0.149941 / -0.063070 / -0.009812 / 0.038370 / 0.225804

Coefficients:
         Estimate Std. Error t value Pr(>|t|)     
(Intercept) 0.326633 0.060733 5.378 5.54e-06
suva -0.012820 0.004745 -2.702 0.0107 *

Signif. codes: 0 ‘’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09313 on 34 degrees of freedom Multiple
R-squared: 0.1768, Adjusted R-squared: 0.1526 F-statistic: 7.3 on
1 and 34 DF, p-value: 0.01068

Why are they different in terms of their coefficients? Which one is correct?

Best Answer

The difference between coefficients is in the relation x versus y which is reversed in the one case.

Note that

in your R case the coefficient relates to 'suva'
and in your Excel case the coefficient relates to 'heather'.

see in the following code where R can get to both cases:

lm(suva ~ heather, data = as.data.frame(data))

Call:
lm(formula = suva ~ heather, data = as.data.frame(data))

Coefficients:
(Intercept)      heather  
      14.65       -13.60  

> lm(heather ~suva, data = as.data.frame(data))

Call:
lm(formula = heather ~ suva, data = as.data.frame(data))

Coefficients:
(Intercept)         suva  
    0.32524     -0.01276

rest of the code:

data <- c(
12.880545,   0.061156645, 0.15   , 0.525,   0,
7.098873327, 0.026878039, 0.2275,  0   ,0,
8.660688381, 0.04037841 , 0.425 ,  0.25 ,   0,
7.734546932, 0.021618446, 0.225 , 0.3875,  0,
16.70696048, 0.103626684, 0.15  ,  0.075,   0,
9.763315183, 0.013387158, 0.25  ,  0.075,   0,
12.91735434, 0.008076468, 0.22  ,  0.22 ,   0,
19.94153851, 0.150798057, 0.0375,  0.35 ,   0.225,
17.25115559, 0.052229596, 0.0625,  0.2625,  0.225,
15.38596941, 0.05429447 , 0.1125,  0.45 ,   0.025,
15.53714185, 0.05933884 , 0.1625,  0.525,   0.0625,
14.11551229, 0.064579437, 0.1875,  0.35 ,   0.1375,
14.88575569, 0.0189853  , 0.3375,  0.3, 0,
12.32229733, 0.043085602, 0.0875,  0.1375,  0,
17.23861185, 0.071705699, 0.15  ,  0.1375,  0,
11.50832463,     0.1125 , 0.0875,  0.075, 0,
14.4810484,  0.078476821, 0.0375,  0.125,   0.0625,
9.110262652, 0.077306938, 0.145 ,  0.35 ,   0.0125,
10.8571733,  0.02681341 , 0.0375,  0.525,   0,
9.589339421, 0.01892435 , 0.2275,  0  , 0,
7.260373588, 0.014538237, 0.425 ,  0.25 ,   0,
11.11099161, 0.022802578, 0.225 ,  0.3875 , 0,
10.81488848, 0.047587818, 0.15  ,  0.075  , 0,
8.224131957, 0.031126904, 0.25  ,  0.075  , 0,
8.818607863, 0.002855409, 0.22  ,  0.22   , 0,
11.53999863, 0.031465613, 0.0375,  0.35   , 0.225,
14.92784964, 0.069998663, 0.0625,  0.2625 , 0.225,
9.666480932, 0.02387741 , 0.1125,  0.45   , 0.025,
12.51000758, 0.016960259, 0.1625,  0.525  , 0.0625,
13.32611463, 0.033670382, 0.1875,  0.35   , 0.1375,
16.76535191, 0.029613698, 0.3375,  0.3 ,0,
11.24615281, 0.008440059, 0.0875,  0.1375,  0,
10.60564875, 0.003930792, 0.15  ,  0.1375,  0,
11.82909125, 0.036017582, 0.1125,  0.0875 , 0.075,
18.2337185,  0.143451512, 0.0375,  0.125  , 0.0625,
10.6226222,  0.020561242, 0.145 ,  0.35   , 0.0125
)
data <- matrix(data,36, byrow=1)
colnames(data) <- c("suva", "Std dev", "heather", "sedge",   "sphagnum")

Why then, is $R^2$ still the same?

There is a certain symmetry in the situation. The regression slope coefficient is (in simple linear regression) the correlation coefficient scaled by the variance of the $x$ and $y$ data.

$$\hat\beta_{y \sim x} = r_{xy} \frac{s_y}{s_x}$$

The regression model variance is then:

$$s_{mod} = \hat\beta_{y \sim x} s_x = r_{xy} s_y$$

and the ratio of model variance and variance of the data is:

$$R^2 = \left( \frac{s_{mod}}{s_y} \right)^2= r_{xy}^2$$

Related Solutions

Solved – Performing logarithmic multiple regression with Excel

If by logarithmic regression you mean the model log(y) = m1.x1 + m2.x2 + ... + b + (Error), you can use LOGEST and GROWTH with multiple independent variables. Note that if you want the estimated coefficients m1, m2, ..., b from LOGEST, you'll have to enter the formula into multiple cells as an array. See Excel's online help for the steps required.

Alternatively, you can log-transform your dependent variable and use LINEST/TREND which does the same thing under the hood.

ObWarning: Excel isn't the best regression package in the world. See, for example, McCullough & Heiser (2008), On the accuracy of statistical procedures in Microsoft Excel 2007, Comp Stats & Data Analysis 52(10) pp.4570-4578.

Solved – How to calculate the Tweedie prediction based on model coefficients

When you pass glm() the tweedie family the return value is a glm object. So you can use the predict() method or the predict.glm() method if you prefer to specify to any future readers of your code that this is a glm.

example(tweedie)
twdeReg <- glm(y~x, family=tweedie(var.power=1, link.power=1))
predict(twdeReg)
predict.glm(twdeReg)

In the predict family of functions you pass the argument newdata=newDataName to specify prediction on a new dataset, default behavior is to predict on the current data. Also, read ?predict to see the 3 options of if you want prediction of the linear combination of predictors, on the y-space, or the other one which I've never found super useful.

Added from comment on the reply:

To get this manually you'll need to use the equation from ?tweedie documentation that describes the link. The doc states: $\mu_i^q = \mathbb{E}(y_i|\vec{x}_i)^q = \vec{x}_i^T\vec{\beta}$ so if you want the expected value you'll need to calculate:

$$\mathbb{E}(y_i|\vec{x}_i) = (\vec{x}_i^T\vec{\beta})^{1/q},$$

where $q$ is the link.power=1 value. so if q=1 as the question is written simply take the product of the estimates times the coefficients and add up all of these products ( $\vec{x}_i^T\vec{\hat{\beta}}$ ) where the 'hat' denotes the estimate.

Regression Statistics

ANOVA

Coefficients

Coefficients con't

Best Answer

Related Solutions

Solved – Performing logarithmic multiple regression with Excel

Solved – How to calculate the Tweedie prediction based on model coefficients

Related Question