Solved – R squared formula for Generalized Linear Models with GAMMA distribution

gamma distributiongeneralized linear modelr-squared

I am running a generalized linear model with Gamma distribution in R (glm, family=gamma) for my data (gene expression as response variable and few predictors). I want to calculate r-squared for this model.

I have been reading about it online and found there are multiple formulas for calculating $R^2$ (psuedo) for glm (in R) with Gaussian (r2 from linear model), logistic regression (1-deviance/null deviance), Poisson distribution (using pR2 in the pscl package, D-squared value from the modEvA R package). But I could not find anything specific to Gamma distributions.

Can pscl and modEVA packages be used for the Gamma distribution as well, or is there any other formula for doing the same?

Best Answer

Your question- yes they can be. Technically, you can use a normal r squared measurement as your goodness of fit measure. It might not be a very good fit measure, but you can certainly use it. Further, you have to ask yourself if your increase in precision is worth the loss of readability of your findings. For example, moving from r-squared to an adjusted r-square is likely to be a meaningful increase in precision at the sacrifice of readability. I personally like McKelvey & Zavoina and other similar approaches (e.g. xu's r squared for mixed models). That does not mean they are the best or only approaches.

Related Solutions

Solved – R-squared in linear model verses deviance in generalized linear model

From what I can tell, we cannot run an ordinary least squares regression in R when using weighted data and the survey package. Here, we have to use svyglm(), which instead runs a generalized linear model (which may be the same thing? I am fuzzy here in terms of what is different).

svyglm will give you a linear model if you use family = gaussian() which seems to be the default from the survey vignette (in version 3.32-1). See the example where they find the regmodel.

It seems that the package just makes sure that you use the correct weights when it calls glm. Thus, if your outcome is continuous and you assume that it is normally iid distributed then you should use family = gaussian(). The result is a weighted linear model. This answer

Why can we not run OLS in the survey package, while it seems that this is possible to do with weighted data in Stata?

by stating that you indeed can do that with the survey package. As for the following question

What is the difference in interpretation between the deviance of a generalized linear model and an r-squared value?

There is a straight forward formula to get the $R^2$ with family = gaussian() as some people have mentioned in the comments. Adding weights does not change anything either as I show below

> set.seed(42293888)
> x <- (-4):5
> y <- 2 + x + rnorm(length(x))
> org <- data.frame(x = x, y = y, weights = 1:10)
> 
> # show data and fit model. Notice the R-squared
> head(org) 
   x          y weights
1 -4  0.4963671       1
2 -3 -0.5675720       2
3 -2 -0.3615302       3
4 -1  0.7091697       4
5  0  0.6485203       5
6  1  3.8495979       6
> summary(lm(y ~ x, org, weights = weights))

Call:
lm(formula = y ~ x, data = org, weights = weights)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-3.1693 -0.4463  0.2017  0.9100  2.9667 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.7368     0.3514   4.942  0.00113 ** 
x             0.9016     0.1111   8.113 3.95e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.019 on 8 degrees of freedom
Multiple R-squared:  0.8916,    Adjusted R-squared:  0.8781 
F-statistic: 65.83 on 1 and 8 DF,  p-value: 3.946e-05

> 
> # make redundant data set with redundant rows
> idx <- unlist(mapply(rep, x = 1:nrow(org), times = org$weights))
> org_redundant <- org[idx, ]
> head(org_redundant)
     x          y weights
1   -4  0.4963671       1
2   -3 -0.5675720       2
2.1 -3 -0.5675720       2
3   -2 -0.3615302       3
3.1 -2 -0.3615302       3
3.2 -2 -0.3615302       3
> 
> # fit model and notice the same R-squared
> summary(lm(y ~ x, org_redundant))

Call:
lm(formula = y ~ x, data = org_redundant)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.19789 -0.29506 -0.05435  0.33131  2.36610 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.73680    0.13653   12.72   <2e-16 ***
x            0.90163    0.04318   20.88   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7843 on 53 degrees of freedom
Multiple R-squared:  0.8916,    Adjusted R-squared:  0.8896 
F-statistic: 436.1 on 1 and 53 DF,  p-value: < 2.2e-16

> 
> # glm gives you the same with family = gaussian()  
> # just compute the R^2 from the deviances. See 
> #   https://stats.stackexchange.com/a/46358/81865
> fit <- glm(y ~ x, family = gaussian(), org_redundant)
> fit$coefficients
(Intercept)           x 
  1.7368017   0.9016347 
> 1 - fit$deviance / fit$null.deviance
[1] 0.8916387

The deviance is just the sum of square errors when you use family = gaussian().

Caveats

I assume that you want a linear model from your question. Further, I have never used the survey package but quickly scanned through it and made assumptions about what it does which I state in my answer.

Modeling – Models for Generalized Estimating Equation Explained

I prefer to call GEE an estimation method compared to ML or REML, since it combines quasi-likelihood estimation with robust variance estimation to estimate generalized linear marginal models for longitudinal data. Some texts and papers also call "GEE models", e.g. Hedeker, D., & Gibbons, R. D. (2006). Longitudinal data analysis. Wiley-Interscience. I guess it is to separate it from subject-specific (fixed and random effects) models, since GEE is mainly regarded as or marginal (population average) models.

We have no idea about the distribution function of the outcome, but we know its mean ($\mu$) and variance ($V$). So we cannot do ML but we can turn to the quasi-likelihood,

$$Q(\mu,y)=\int^{\mu}_y(y-t)^TV^{-1}dt,$$

and the quasi-likelihood estimating equations (quasi-score function) is

$$\sum_i\frac{\partial{\mu_i^{'}}}{\partial{\beta}}V_i^{-1}(y_i-\mu_i)=0.$$

Thus the estimating equations are derived without specifying the joint distribution of a outcomes but they reduce to the score equations (marginal distributions). The approach based on maximum likelihood (ML) estimation specifies the joint multivariate normal distribution of outcome variables, while the approach of GEE based on the quasi-likelihood specifies only the marginal distributions.

I have seen GEE was applied in statistical genetics, but I am afraid it is also under the framework of generalized linear models.

Best Answer

Related Solutions

Solved – R-squared in linear model verses deviance in generalized linear model

Caveats

Modeling – Models for Generalized Estimating Equation Explained

Related Question