Solved – Best method to analyse whole population data

populationregressionspssvariance

I have data of an entire population (N=27). I would like to find out which variables (of many) have the greatest effect on one certain dependent variable, and how much of its variance can they explain. The possible independent variables could be correlating as well (some of them certainly do). They are all scale measured.

What do you think would be the best method to use and what should I pay attention to?

I'm limited to SPSS, and rather sketchy statistical knowledge. I have been trying to get something with linear regression but I fail to achieve anything interpretable, and it crossed my mind that I might need something entirely different since this is not a sample.

Thanks in advance!

EDIT: For this question this might not be important, but: I have all the variables 5 times for 5 separate years. Later on I am planning on examining how the effect of a variable changes over time.

EDIT nr.2: After the first replies it seems the best to detail the database I have: The 27 cases are 27 European countries, and the dependent variable is the percentage of their population that participated in demonstrations in that year. I also have lots of possible independent variables like gdp, unemployment, happiness, etc. I have all these values for 5 separate years. And basically I'm trying to find something that I can write about in my thesis, like "gdp is the biggest factor and its twice as big as happiness blabla …in these countries". The reason why I'm saying its the entire population, because I am not planning on drawing general conclusions. I wouldn't be able to do that anyway as the countries aren't even representative for Europe.

Best Answer

Sounds like random forests (using regression trees) are the perfect tool for you. You can use regression trees to build a series of trees, then check the variable importance.

I don't know much about SPSS, but if you are willing to use R (come on... you know you want to!), the caret package will be able to do this with the train() function (by specifying importance=TRUE and using rfFuncs in the control function). You can then view the importance of each variable. The varImp() function gives you more control. If you want to see what number of variables give you the best results, you can use rfe().

Caret can be a little difficult to wrap your head around, so if you want to include all your variables, you can use this simpler (but less flexible) code from the randomForest package (included with caret):

require(randomForest)
df<-read.csv(file.choose()) ### assuming your data is in a csv
rf.fit<-randomForest(x=df[,1:??],y=df[,??+1],ntree=500,importance=TRUE) #assuming the ?? idependent variables in columns 1-?? and the response is in ??+1 column
print(rf.fit$importance) #importance of variables
print(rf.fit$rsq)  #psuedo rsquared of model

I would do this year by year, rather than include the year as a varibale. Time will no doubt play a role in the regression, but I think it would make more sense to build each model for each year and then look for changes in variable importance over time- though 5 isn't a lot to build a powerful time series with. Others might disagree and I wouldn't mind hearing others' opinions.

If you don't want to use R, I would update your question to signal for SPSS users who might know how to implement random forests with variable importance.

Related Question