MATLAB: How to vectorize a specific for-loop

for looptutorialvectorization

I am trying to vectorize the for-loop hereafter. Would you have any hint? Thank you

for i = 1 : numel(text)-k+1  % "text" is a string
  pattern(i,:) = text(i:i+k-1);
end

Best Answer

Before you start a vectorization, care for a pre-allocation:

n = numel(str) - k + 1;
pattern = repmat(' ', n, k);   % <-- added
for ii = 1:n
  pattern(ii,:) = str(ii:ii+k-1);   % [EDITED, was: pattern(k, :)]
end

For a string with 10'000 characters and k=6 this 8 times faster already. But you can get much faster with a loop:

n = numel(str) - k + 1;
pattern = repmat(' ', n, k);
for ii = 1:k                         % <-- k instead of n
   pattern(:, ii) = str(ii:ii+n-1);  % <-- (:, ii) instead of (ii, :)
end

Now the values are copied to a continuous block of memory, which is much faster. In addition the loop is much shorter assumed that k << n.

[EDITED] For your amusement a vectorized version:

index   = bsxfun(@plus, (1:numel(str) - k + 1).', 0:k-1);
pattern = str(index);

[EDITED 2] And if we are on the way, a C-MEX for completeness:

#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
  mwSize k, n, f;
  mxChar *in, *out;
  mwSize dims[2];
    // Get inputs:
    in = (mxChar *) mxGetData(prhs[0]);
    k  = (mwSize) mxGetScalar(prhs[1]);
    // Create outputs:
    n       = mxGetNumberOfElements(prhs[0]) - k + 1;
    dims[0] = n;
    dims[1] = k;
    plhs[0] = mxCreateCharArray(2, dims);
    out     = (mxChar *) mxGetData(plhs[0]);
    // Copy data:
    f = out + n * k;
    while (out < f) {
      memcpy(out, in++, n * sizeof(mxChar));
      out += n;
    }
  }

Timings for Matlab 2015b, Win64, mean of 100 iterations:

str = repmat(' ', 1, 10000);
k   = 6;
Original:        0.134 sec
Pre-allocation:  0.0178 sec
Vectorized:      0.000806 sec
Continuous:      0.000348 sec
C-MEX:           0.0000529 sec

Now use the profiler to find out, if this piece of code is still the bottleneck of the program. If this piece of code takes 1% of the processing time only, accelerating it by a factor of 2 let the total program run only 0.5% faster. Therefore it is most likely a waste of time.

[EDITED 2] Conclusion:

The continuous copying is the fastest M-version I can imagine.
Vectorizing wastes time with creating the large index. Do not overestimate vectorization, when you do not operate on complete matrices.
The C-MEX is 7 times faster than the continuous copy version.
Avoiding the creation of the large redundant array would be much more efficient. Better use str(ii:ii+k-1) in the code instead of the fluffy pattern matrix.

Related Solutions

MATLAB: Faster way to initilize arrays via empty matrix multiplication

... got me to the link above...

I cannot find the link you refer to.

The Windows task manager shows that

    Z = zeros(1000,0)*zeros(0,1000);

doesn't allocate memory (R2012a, 64bit, Win7). Or rather the task manager doesn't show any allocation of memory for that assignment.

ADDED

I learned two things. There is a small speed advantage of using this way of memory allocation. While experimenting I made it necessary to restart the computer and I lost my desktop configuration.

    N = 1e4;
    clear z1
    tic
    z1 = zeros( N ); 
    for cc = 1 : N
        z1(:,cc)=cc;
    end
    toc
    clear z0
    tic
    z0 = zeros(N,0)*zeros(0,N);
    for cc = 1 : N
        z0(:,cc)=cc;
    end
    toc
    Elapsed time is 0.686084 seconds.
    Elapsed time is 0.532437 seconds.

MATLAB: Fast sum 3D matrix, Mex file

The job should be done very efficiently by Matlab already:

pix = sum(A, 3);

The C code looks okay, but maybe it is simply a memory problem. A [1280 x 1280 x 700] array of type double needs 9.18 GB. Creating a second one might exhaust your RAM, such that the slow disk caching is used. You should see an increased disk access then.

Some hints:

#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[],
               int nrhs, const mxArray *prhs[])
{
  // Do not use mxGetData for a double array, because this is a job
  // for mxGetPr:
  double* Matrix_3D = mxGetPr(prhs[0]);
    // Use mwSize and do not speculate that it equals int:
    const mwSize* Dim3Dmatrix = mxGetDimensions (prhs[0]);
    mwSize i, j, n;
    double* mat2Dout, pix;
    plhs[0]  = mxCreateDoubleMatrix (Dim3Dmatrix[0], Dim3Dmatrix[1], mxREAL);
    // No need to cast the output of mxGetPr to double *, because it
    // is one already:
    mat2Dout = mxGetPr(plhs[0]);
    // Use one linear index for the 1st and 2nd dimension.
    // Access neighboring elements of input and output to use the 
    // processor cache efficiently:
    n = Dim3Dmatrix[0] * Dim3Dmatrix[1];
    for(j = 0; j < Dim3Dmatrix[2]; j++) {
      for(i = 0; i < n; i++) {
        mat2Dout[i] += *Matrix_3D++;
      }
    }
  }

Accessing the elements of the input in large steps is not efficient, because the CPU can read a cacheline (64 byte usually) at once. Therefore the modified method is faster: For a (500, 500, 700) it needs 0.7 sec instead of 5.4 for the original version. sum is multi-threaded in addition and needs 0.55 sec, by the way. (Measured under Matlab R2016b, Core2Duo).

Your array is 6.5 time larger and the original code needs 8 times more run time. This does not sound like disk caching. So maybe it is a CPU cache problem only, or you use an even slower processor than I do.

Best Answer

Related Solutions

MATLAB: Faster way to initilize arrays via empty matrix multiplication

MATLAB: Fast sum 3D matrix, Mex file

Related Question