MATLAB: Number of threads for calculations in MEX

multi-threadedparallelthreads

I have created a mutli-threaded version of the C-Mex function of FilterM (you do not need to download this to answer the question). The speed-up scales well with the number of processors for large problems. But for small problems, a smart method is required to choose the optimal number of threads, because starting a thread consumes a remarkable period of time. E.g. starting a 2nd thread for filtering the columns of a [1000 x 2] matrix is 50% slower than perform the job in the single main thread. For my example the columns of the input matrix are distributed to the different threads.

The length of the signal, the number of channels, the order of the filter and the FIR/IIR type, the number of available cores and the current system load matter the best choice.

Matlab uses magic limits for some multi-threaded functions:

SUM starts one thread per core for a [1 x n] vector and n >= 89000 (in consequence there is a slow-down on a single core CPU)
FILTER starts one thread per core for matrices of >= 16 columns

A better strategy is to start nCore-1 threads and calculate one chunk of data in the main thread.

I know how to solve a multi-parameter optimization problem, but even with this I would get optimal parameters for my own processor only. And solving this on the individual client computer (and for the current processor load…) is clearly an overkill.

What are standard and smart methods to choose the number of threads for a specific problem?

Best Answer

Although multi-threading is an essential topic in the times of multi-core processors, the obvious lack of answers is a clear mark:

 There is no sufficient general strategy to decide for the number of cores.

There are too many factors, which influence the efficiency of the distribution to threads:

Hyper-threading can increase or decrease the processing time.
The (dynamically changing) number of not busy cores is unknown.
Turbo-boost can slow down the processing, when more cores are active. (Btw., "turbo-boost" is a funny name for a feature, which slows down the processor, when all cores are busy. It is amusing, that the processor manufacturers choose the opposite view, that it runs faster, when some cores are sleeping.)
It cannot be estimated, if the caches are exhausted.
The time to start a new thread differs widely between different processors.

Therefore the following might be a fair solution:

Let N be the number of real (or virtual) processors.
Split the work into M independent chunks.
While there are unprocessed chunks and the number of started threads <= N
Start a new thread which fetches a new chunk autonomously until all data are processed.
Back to 3.
Close threads.

Then starting a new, but not needed thread wastes time, but only on a core, which is not working on the problem yet. A small problem will be solved before all threads are started.

Of course this is not an optimal strategy also, but I think it is more efficient than any fixed relation between the number of indpendent chunks and cores.

Related Solutions

MATLAB: Parallel Matrix Multiplication on a Distributed Computing System

There are two levels of parallelism present in MATLAB:

Implicit Multi-threaded parallelism for certain built-in MATLAB commands, such as Matrix-Matrix Multiplication or Matrix Factorization.
Explicit parallelism present in Parallel Computing Toolbox

Let's focus on the implicit multi-threaded parallelism first.

If you have a multi-core or multi-processor machine then the implicit multi-threaded parallelism is on by default in the client MATLAB. When I use the term client MATLAB, I mean the interactive MATLAB that you are running on your Windows/Linux/Mac desktop.

The number of threads used is set automatically by MATLAB at run time. You can type

>>maxNumCompThreads

to find out how many threads MATLAB is using for computation.

Keep in mind that MATLAB ignores hyperthreading. So for example if you have a hyperthreaded processor, your operating system might report 8 cores, but MATLAB will only see the 4 physical cores and report 4 as result of maxNumCompThreads.

If you need to you can disable implicit MATLAB multi-threading using one of the following:

Start Start MATLAB with -singleCompThread startup option
Type maxNumCompThreads(1) in your program

Note that maxNumCompThreads is currently deprecated and could be discontinued in a future release of MATLAB.

http://www.mathworks.com/help/releases/R2011b/techdoc/ref/maxnumcompthreads.html http://www.mathworks.com/help/releases/R2011b/techdoc/ref/matlabwindows.html

In relation to the earlier answer, if you start MATLAB on a machine that has 192 cores, MATLAB will report maxNumCompThreads of 192. On such a large machine the client MATLAB will have implicit parallelism of 192 threads. However, you as a user will not be able to control on which cores the threads are run. That will be handled by the operating system.

Now let's discuss the explicit parallelism provided by the Parallel Computing Toolbox and MDCS.

When you type MATLABPOOL open, MATLAB starts instances of headless MATLAB workers. These workers run either on your local machine (local scheduler) or on a MDCS cluster. These workers are by default single-threaded.

You can test that using the following code snippet:

matlabpool open
spmd
   maxNumCompThreads
end
matlabpool close

However, if you believe that your application would benefit from using a hybrid of explicit and implicit parallelism, for example your tasks are performing many matrix multiplications or matrix factorizations, you can re-enable the implicit parallelism by placing

maxNumCompThreads(N)

at the top of the function you are trying to run on workers, or inside spmd block.

spmd
   maxNumCompThreads(N);
end

In my example N is the number of threads that you want to use and should be a reasonable value, for example 2, 4, 8.

You should only re-enable implicit parallelism on workers in situations where it is really warranted, for example, if you have a single MATLAB worker on a multi-core node.

If you have multiple MATLAB workers running on a single node, I would not recommend enabling multi-threaded support as this will most likely result in performance degradation. See Cleve’s article for benchmarks related to mixing multi-threading and workers: http://www.mathworks.com/company/newsletters/news_notes/june07/clevescorner.html

At the moment it is still possible to enable multi-threading on the workers using the maxNumCompThreads command. However, this functionality is deprecated and could be removed in a future release of MATLAB.

MATLAB: Multithreaded FILTER

I ran this on my 8-core machine using R2009a:

function myFilterTest
x  = rand(1e6, 8);
x1 = x(:,1);
x2 = x(:,2); 
x3 = x(:,3);
x4 = x(:,4); 
x5 = x(:,5);
x6 = x(:,6); 
x7 = x(:,7);
x8 = x(:,8); 
[B, A] = butter(3, 0.2, 'low');
tic;
for i=1:100
 y = filter(B, A, x);   % Matrix
 % clear('y');          % Avoid smart JIT interferences => same effects!
end
toc
tic;
for i=1:100
 y1 = filter(B, A, x1);     % Eight vectors
 y2 = filter(B, A, x2);
 y3 = filter(B, A, x3);
 y4 = filter(B, A, x4);
 y5 = filter(B, A, x5);
 y6 = filter(B, A, x6);
 y7 = filter(B, A, x7);
 y8 = filter(B, A, x8);
   % clear('y1', 'y2');       % No qualitative changes
end
toc

clear all;

And got this:

Elapsed time is 16.865596 seconds. 
Elapsed time is 16.117599 seconds.

Only one core was active during each test.

I ran this on my 8-core machine using R2011a and got:

Elapsed time is 12.542615 seconds.
Elapsed time is 16.268821 seconds.

All eight cores were active for the first test (on the matrix) and only a single core for the seconds test (on individual vectors).

I added this to the bottom of the test:

y_par = zeros(size(x));
matlabpool(8);tic;  
parfor j = 1:8
 for i=1:100 
     y_par(:,j) = filter(B, A, x(:,j));
 end   
 % clear('y_par');       % No qualitative changes
end
toc; matlabpool close;

And got this when using R2011a:

Elapsed time is 13.305009 seconds.
Elapsed time is 16.398203 seconds.
Starting matlabpool using the 'local' configuration ... connected to 8 labs.
Elapsed time is 3.542021 seconds.
Sending a stop signal to all the labs ... stopped.

Best Answer

Related Solutions

MATLAB: Parallel Matrix Multiplication on a Distributed Computing System

MATLAB: Multithreaded FILTER

Related Question