MATLAB: MATLAB Coder: Matrix-Sca​lar-Multip​lication slower in generated code

execution speedmatlab codermatrix manipulation

We are generating DLLs and MEX-Files from MATLAB-Code and realized that Matrix*Scalar-Operations take 3-10 times longer in generated code in comparison to the original MATLAB code. Can anyone explain this slowdown?
I wrote two toy-functions for Matrix*Scalar and Matrix*Vector. For the latter, execution time was the same in the orginial and the generated code. The matrix size was [1000×1000] for both cases.
Interestingly, the MEX calls BLAS-library for Matrix*Vector but not for Matrix*Scalar. May this be a reason?
Toy function for Matrix*Scalar:
function [MatrixOut] = MatrixScalar_Function(MatrixIn,ScalarIn)
MatrixOut = MatrixIn;
for index = 1:1000
MatrixOut = MatrixOut*ScalarIn;
end
end
Generated C-Code for Matrix*Scalar:
/*
* MatrixScalar_Function.cpp
*
* Code generation for function 'MatrixScalar_Function'
*
*/
/* Include files */
#include "rt_nonfinite.h"
#include "MatrixScalar_Function.h"
#include "MatrixScalar_Function_data.h"
/* Function Definitions */
void MatrixScalar_Function(const emlrtStack *sp, const real_T MatrixIn[1000000],
real_T ScalarIn, real_T MatrixOut[1000000])
{
int32_T b_index;
int32_T i0;
memcpy(&MatrixOut[0], &MatrixIn[0], 1000000U * sizeof(real_T));
b_index = 0;
while (b_index < 1000) {
for (i0 = 0; i0 < 1000000; i0++) {
MatrixOut[i0] *= ScalarIn;
}
b_index++;
if (*emlrtBreakCheckR2012bFlagVar != 0) {
emlrtBreakCheckR2012b(sp);
}
}
}
/* End of code generation (MatrixScalar_Function.cpp) */

Best Answer

I'm only seeing about a 5% difference in timing when comparing the BLAS dscal function call to an explicit loop in my R2017b Win64. Certainly not the 3-10 times difference that you are seeing. You might try replacing that loop with a dscal call and see what you get in your case. E.g., replace this
for (i0 = 0; i0 < 1000000; i0++) {
MatrixOut[i0] *= ScalarIn;
}
with something like this
#include "blas.h"
:
int64_T n, incx; <-- or maybe int32_T in your case
:
incx = 1;
n = 1000000;
dscal( &n, &ScalarIn, MatrixOut, &incx );
But I do see a big difference in timing when compared to the m-code. My guess is that perhaps the BLAS dscal routine is not multi-threaded and that is why the timing is nearly the same as a manual loop, but MATLAB uses a multi-threaded scalar multiply routine in the background for the m-code.