Solved – How to optimise computational efficiency when fitting a complex model to a large data set repeatedly

computational-statisticsmarkov-chain-montecarlomixed modelr

I am having performance issues using the MCMCglmm package in R to run a mixed effects model. The code looks like this:

MC1<-MCMCglmm(bull~1,random=~school,data=dt,family="categorical"
, prior=list(R=list(V=1,fix=1), G=list(G1=list(V=1, nu=0)))
, slice=T, nitt=iter, ,burnin=burn, verbose=F)

There are around 20,000 observations in the data and they are clustered in around 200 schools. I have dropped all unused variables from the dataframe and removed all other objects from memory, prior to running. The problem I have is that it takes a very long time to run, unless I reduce the iterations to an unacceptably small number. With 50,000 iterations, it takes 5 hours and I have many different models to run. So I would like to know if there are ways to speed up the code execution, or other packages I could use. I am using MCMCglmm because I want confidence intervals for the random effects.

On the other hand, I was hoping to get a new PC later this year but with a little luck I may be able to bring that forward, so I have been wondering how to best spend a limited amount of money on new hardware – more RAM, faster CPU etc. From watching the task manager I don't believe RAM is the issue (it never gets above 50% of physical used), but the CPU usage doesn't get much above 50% either, which strikes me as odd. My current setup is a intel core i5 2.66GHz, 4GB RAM, 7200rpm HDD. Is it reasonable to just get the fastest CPU as possible, at the expense of additional RAM ? I also wondered about the effect of level 3 CPU cache size on statistical computing problems like this ?

Update: Having asked on meta SO I have been advised to rephrase the question and post on Superuser. In order to do so I need to give more details about what is going on "under the hood" in MCMCglmm. Am I right in thinking that the bulk of the computations time is spent doing optimisation – I mean finding the maximum of some complicated function ? Is matrix inversion and/or other linear algebra operations also a common operation that could be causing bottlenecks ? Any other information I could give to the Superuser community would be most gratefully received.

Best Answer

Why not run it on Amazon's EC2 cloud-computing service or a similar such service? MCMCpack is, if I remember correctly, mostly implemented in C, so it isn't going to get much faster unless you decrease your model complexity, iterations, etc. With EC2, or similar cloud-computing services, you can have multiple instances at whatever specs you desire, and run all of your models at once.

Related Solutions

Solved – Time variable in Longitudinal data set mixed model question

You have quite a few potential variables to include in your model, and I think there is a real possibility of underpowering the analysis. In English: I think you want to fit the simplest model you can. If you wish to include Quarter, year, or period, you'll either need to have these specified as factors (you may have done this already) or alternatively fit a set of dummy variables - using as.factor is much easier. :)

I would try this simple model first:

lmer(Sales ~ Policy + Quarter + (1|Time), data=data)

I think that the Quarter factor is best for trying to capture trend - it's a smaller subset of year, and I wouldn't include other factors like Region or Team yet as that will complicate the model. You're looking for a main effect for Policy. I have included Time as a random effect as I think that is the best way of capturing the idea that the Sales vary randomly over time, and we wish to generalise the policy effect over time, so Time should not be fitted as a fixed effect.

If you wish, you could start adding in more of your extra variables one-by-one and then compare the model outputs using aov assuming you're storing the lmer results. But I wouldn't start with a complicated model first.

Update: the reason I suggest Time as a random effect is that you have time points before and after the policy implementation that aren't in your data. Also, with the simple model I have suggested, there are repeated measures at each time point, from region, sales team, etc, that aren't in the model, so I think that using Time in this model as a random effect is the best way of representing all that underlying complexity.

R Mixed Model – Fitting Multilevel Models to Complex Survey Data in R

As far as I know you can't really do this in R at the moment, if you actually need a mixed model (eg, if you care about the variance components)

The weights argument to lme4::lmer() won't do what you want, because lmer() interprets the weights as precision weights not as sampling weights. In contrast to ordinary linear and generalised linear models you don't even get correct point estimates with code that treats the sampling weights as precision weights for a mixed model.

If you don't need to estimate variance components and you just want the multilevel features of the model to get correct standard errors you can use survey::svyglm().

Best Answer

Related Solutions

Solved – Time variable in Longitudinal data set mixed model question

R Mixed Model – Fitting Multilevel Models to Complex Survey Data in R

Related Question