The job should be done very efficiently by Matlab already:
The C code looks okay, but maybe it is simply a memory problem. A [1280 x 1280 x 700] array of type double needs 9.18 GB. Creating a second one might exhaust your RAM, such that the slow disk caching is used. You should see an increased disk access then.
Some hints:
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
// Do not use mxGetData for a double array, because this is a job
// for mxGetPr:
double* Matrix_3D = mxGetPr(prhs[0]);
// Use mwSize and do not speculate that it equals int:
const mwSize* Dim3Dmatrix = mxGetDimensions (prhs[0]);
mwSize i, j, n;
double* mat2Dout, pix;
plhs[0] = mxCreateDoubleMatrix (Dim3Dmatrix[0], Dim3Dmatrix[1], mxREAL);
// No need to cast the output of mxGetPr to double *, because it
// is one already:
mat2Dout = mxGetPr(plhs[0]);
// Use one linear index for the 1st and 2nd dimension.
// Access neighboring elements of input and output to use the
// processor cache efficiently:
n = Dim3Dmatrix[0] * Dim3Dmatrix[1];
for(j = 0; j < Dim3Dmatrix[2]; j++) {
for(i = 0; i < n; i++) {
mat2Dout[i] += *Matrix_3D++;
}
}
}
Accessing the elements of the input in large steps is not efficient, because the CPU can read a cacheline (64 byte usually) at once. Therefore the modified method is faster: For a (500, 500, 700) it needs 0.7 sec instead of 5.4 for the original version. sum is multi-threaded in addition and needs 0.55 sec, by the way. (Measured under Matlab R2016b, Core2Duo).
Your array is 6.5 time larger and the original code needs 8 times more run time. This does not sound like disk caching. So maybe it is a CPU cache problem only, or you use an even slower processor than I do.
Best Answer