Solved – How to optimise computational efficiency when fitting a complex model to a large data set repeatedly

computational-statisticsmarkov-chain-montecarlomixed modelr

I am having performance issues using the MCMCglmm package in R to run a mixed effects model. The code looks like this:

MC1<-MCMCglmm(bull~1,random=~school,data=dt,family="categorical"
, prior=list(R=list(V=1,fix=1), G=list(G1=list(V=1, nu=0)))
, slice=T, nitt=iter, ,burnin=burn, verbose=F)

There are around 20,000 observations in the data and they are clustered in around 200 schools. I have dropped all unused variables from the dataframe and removed all other objects from memory, prior to running. The problem I have is that it takes a very long time to run, unless I reduce the iterations to an unacceptably small number. With 50,000 iterations, it takes 5 hours and I have many different models to run. So I would like to know if there are ways to speed up the code execution, or other packages I could use. I am using MCMCglmm because I want confidence intervals for the random effects.

On the other hand, I was hoping to get a new PC later this year but with a little luck I may be able to bring that forward, so I have been wondering how to best spend a limited amount of money on new hardware – more RAM, faster CPU etc. From watching the task manager I don't believe RAM is the issue (it never gets above 50% of physical used), but the CPU usage doesn't get much above 50% either, which strikes me as odd. My current setup is a intel core i5 2.66GHz, 4GB RAM, 7200rpm HDD. Is it reasonable to just get the fastest CPU as possible, at the expense of additional RAM ? I also wondered about the effect of level 3 CPU cache size on statistical computing problems like this ?

Update: Having asked on meta SO I have been advised to rephrase the question and post on Superuser. In order to do so I need to give more details about what is going on "under the hood" in MCMCglmm. Am I right in thinking that the bulk of the computations time is spent doing optimisation – I mean finding the maximum of some complicated function ? Is matrix inversion and/or other linear algebra operations also a common operation that could be causing bottlenecks ? Any other information I could give to the Superuser community would be most gratefully received.

Best Answer

Why not run it on Amazon's EC2 cloud-computing service or a similar such service? MCMCpack is, if I remember correctly, mostly implemented in C, so it isn't going to get much faster unless you decrease your model complexity, iterations, etc. With EC2, or similar cloud-computing services, you can have multiple instances at whatever specs you desire, and run all of your models at once.

Related Question