I have been running a large parallel job on a cluster at my university. Occassionally the file output will be missing a line of text. My support people suggest that this might be due a "race condition", where the "master" is interrupted during an fprintf call. My job is "embarrassingly simple", where a "master" runs the same "task" across a set of "labs". The jobs are deployed and recovered by the master in a serial fashion so I see no obvious reason for the race condition. For those who would like more details: the job is run using the Parallel Computing Toolbox (PCT) with 28 cores on a single computer node. The repeating task takes about 1.5 minutes of wall time on a single core, so the master has to handle input and output for each lab at an interval of about every 3 seconds.
I have found that the race condition goes away when I use the "W" option in fprintf, which invokes a 4k buffer for the fprintf output. That means that have to wait for about 50 tasks to finish before I see the first output. I would prefer to see the output occur more frequently, for troubleshooting and quality control.
I got a suggestion from one of the cluster support people to start the parpool with one less core than available. Up until now, I have been running 28 labs on the 28 available cores. That means that there are actually 29 tasks running on the computing node. In other words, one of the cores is handling a coexisting master and lab.
This issue has sparked my question: How does PCT allocate labs across available cores? Does it actually try to avoid starting labs on core where the master resides? To be clear, the master is the first instance of matlab, and it is where the labs are initiated via a call to the parpool function.
I have searched for an answer to this question, and found nothing yet on the web. I am hoping that there is someone out there who, given experience, knows the answer to this question. I thought to do some experimenting myself to find an answer, but it is not obvious to me how to determine where, among the available cores, the labs and master reside.
Best,
Mark
Best Answer