There are several reasons why different numbers of workers can behave differently. In some situations, even if each PARFOR loop iterate takes quite a long time, it can be quicker to run fewer workers than you have cores available; other times, it can be quicker to run more workers than you have cores. This is because of the various resource contentions that your code encounters.
If your algorithm is memory bound - i.e. the main contention is for access to RAM (for example, adding together two large matrices - the amount of computation is trivial compared to the time it takes to get the data into the CPU), then you often find that fewer workers perform better.
If your algorithm is compute bound - i.e. not much memory access compared to the compuational complexity, then more workers (up to the number of physical cores) works better.
It's possible in some cases that if your algorithm is bounded by some sort of latency elsewhere, that running more workers than you have cores works best.
Best Answer