Solved – Comparing different linear regression models

linearregression

Good evening everyone.

Ill get to the point straight away. For my master thesis I've set out three different questionnaires to check which one provided the most valuable explanation of a dependent variable. Each respondent only received one type of questionnaire.

I now got three datasets with an Y-variable and dummies for subjects they mentioned (or not). My supervisor, allowed me to use multiple linear regression (even though in my limitations I should state this study should be replicated using another regression method).

I want to somehow conclude which model explains most variation in the Y-variable, but was also told I can not rely on adusted-r2 (there's a lot of other reasons you cannot rely on adjusted-r2, I know) because the differences between the models can be due to differences in the different samples.

Is there some kind of metric you guys advise me to read about to get more knowledge on how to compare the different models?

My goal was never to come up with statistical evidence, as I was just planning to compare in a qualitative matter how the responses on the different questionnaires differed, but the dean of my university wanted this to be added, because I now have the datasets anyway. It seems like a logical advise, but I simply do not know how to compare them.

Thanks in advance!

Best Answer

While this does depend on the theoretical nature of the model, you might want to try using an F-test, where the F statistic is the variation between sample means/variation between the samples.

This test is used to compare models in order to determine which one can best explain the variation in the dependent variable. You might consider incorporating this test into a one-way ANOVA: Understanding Analysis of Variance (ANOVA) and the F-test

That being said, you mention that you have used three different questionnaires. Be cautious if the number of observations for the three regression models are not equal, in which case the F-test could also be unreliable. e.g. if a model has 100 observations, the F-test could show a lower "fit" than one with 200, but if the number of observations for the first model had been increased, then it could in fact have the best fit.

You could also compute the power of a test for your three samples - i.e. identification of the minimum number of observations that would be needed for your results to be reliable. If your sample for the three models is shown to be large enough, then tests such as the F-test would be more reliable.

Related Solutions

Solved – Regression for a dependent variable which is rank order (ranking)

One person assigns a value to four objects and the values are ordered. If three objects are assigned values, the value to be assigned to the fourth object is predetermined.

Now, this means that for each observation, we have a permutation of {1,2,3,4}.

All possible permutations are 24 for this set. Each permutation can be given an id. This id column will represent all 4 readings for that observation. Now, this id column will replace the four columns in dependent variable and we can regress it using say, logistic model. Number of classes will be 24, so this thing will depend upon what all permutations you have and the number of observations as well. So, depending upon no. of observations and no. of actual permutations present, you can give "id"s accordingly. Now, when you predict the permutation, we will at once know the permutation e.g. by 12 if we mean {2,1,4,3} then if predicted reading is 12, we will at once get the column of readings.

Solved – ANCOVA in R suggests different intercepts, but the 95% CIs overlap… how is this possible

Remember that the difference between significant and non-significant is not (always) statistically significant

Now, more to the point of your question, model 1 is called pooled regression, and model 2 unpooled regression. As you noted, in pooled regression, you assume that the groups aren't relevant, which means that the variance between groups is set to zero.

In the unpooled regression, with an intercept per group, you set the variance to infinity.

In general, I'd favor an intermediate solution, which is a hierarchical model or partial pooled regression (or shrinkage estimator). You can fit this model in R with the lmer4 package.

Finally, take a look at this paper by Gelman, in which he argues why hierarchical models helps with the multiple comparisons problems (in your case, are the coefficients per group different? How do we correct a p-value for multiple comparisons).

For instance, in your case,

library(lme4)
summary(lmer( leg ~ head + (1 | site)) # varying intercept model

If you want to fit a varying-intercept, varying slope (the third model), just run

summary(lmer( leg ~ head + (1 | site) + (0+head|site) )) # varying intercept, varying-slope model

Then you can take a look at the group variance and see if it's different from zero (the pooled regression isn't the better model) and far from infinity (unpooled regression).

update: After the comments (see below), I decided to expand my answer.

The purpose of a hierarchical model, specially in cases like this, is to model the variation by groups (in this case, Sites). So, instead of running an ANOVA to test if a model is different from another, I'd take a look at the predictions of my model and see if the predictions by group is better in the hierarchical models vs the pooled regression (classical regression).

Now, I ran my sugestions above and foudn that

ranef(lmer( leg ~ head + (1 | site) + (0+head|site) )

Would return zero as estimates of varying slope (varying effect of head by site). then I ran

ranef(lmer( leg ~ head + (head| site))

And I got a non-zero estimates for the varying effect of head. I don't know yet why this happened, since it's the first time I found this. I'm really sorry for this problem, but, in my defense, I just followed the specification outlined in the help of the lmer function. (See the example with the data sleepstudy). I'll try to understand what's happening and I'll report here when (if) I understand what's happening.

Best Answer

Related Solutions

Solved – Regression for a dependent variable which is rank order (ranking)

Solved – ANCOVA in R suggests different intercepts, but the 95% CIs overlap… how is this possible

Related Question