MATLAB: Efficient method for finding index of closest value in very large array for a very large amount of examples

arrayfindspeed

I have two very large one dimensional arrays, 'aRef' which is around 11,000,000 elements and 'aTest' which is around 10,000,000 elements. I need to find the index of the closest element in 'aRef' for all elements in 'aTest'. 'aRef' is sorted and 'aTest' can be sorted if that will help performance.

– Method 1: Returns at out of memory error as the arrays are far too large

diff = abs(bsxfun(@minus,aRef,aTest'));
[~, I] = min(diff);

– Method 2: Takes around 0.03 seconds per iteration (but varies greatly) and therefore around 300000 seconds in total

for k = 1:n
  diff = abs(aRef- aTest(k));
  [~, I(k)] = min(diff);
end

– Method 3: Takes around 0.013 seconds per iteration and therefore 130000 seconds in total

 for k = 1:n
   i_lower  = find(aRef <= aTest(k),1,'last');
   i_higher = find(aRef >= aTest(k),1,'first');
 end

Is there a more efficient method for this that won't exhaust the memory or take so long to run?

Thanks for your help.

Best Answer

Note: Using diff as a variable name is not a good idea as it shadows the very useful diff function. Also, for method 2, your code does not show the preallocation of I. If you don't preallocate I, it will seriously slow down the code.

Anyway, for two vectors of around 10,000 elements, the following is around 200 times faster than your method 1 on my machine.

edges = [-Inf, mean([aRef(2:end); aRef(1:end-1)]), +Inf];
I = discretize(aTest, edges);

Basically, it construct an edge vector half way between each elements of your aRef, and use the histogram functions of matlab to get the bin index your aTest would fall in. discretize is new in R2015a. On 2014b, you can use the third return value of hiscounts. On even older versions, the 2nd return value of histc (although histc behaves slightly differently with regards to the last bin).

%2014b
[~, ~, I] = histcounts(aTest, edges); %probably slower than discretize
%before 2014b
[~, I] = histc(aTest, edges); %return an extra element (for the +Inf bin)
I(end) = [];

Related Solutions

MATLAB: MEX problem with mxGetData

When you use the curly braces { } in MATLAB you are building a cell array, so you need to use mxGetCell in your mex routine. E.g.,

prhs[1] = {2, [5:12]}
prhs[1] = {3, [ ], [1:5]}
prhs[1] = {{4,6},[1:5], {1,8,9}, [3:6]}

In all of these cases you can get at the data as follows:

mxArray *cell;
double *pr;
mwSize i, j, n, ncell;
    :
if( mxIsCell(prhs[1]) ) {
    ncell = mxGetNumberOfElements(prhs[1]);
    for( i=0; i<ncell; i++ ) {
        cell = mxGetCell(prhs[1],i);
        if( mxIsEmpty(cell) ) {
            // Code to handle empty case
        } else {
            if( mxIsDouble(cell) {
                n = mxGetNumberOfElements(cell);
                pr = mxGetPr(cell);
                for( j=0; j<n; n++ ) {
                    // Code to manipulate pr[j] here
                }
            } else {
                // Code to handle non-double case
            }
        }
    }
}

MATLAB: MATLAB crashes when running mex file with recursive function

You've got your C mex indexing wrong. Remember, C is 0-based indexing, not 1-based indexing. So in these lines:

    l  = (mwSize) mxGetScalar(prhs[0]);
    m     = (mwSize) mxGetScalar(prhs[3]);
    p     = (mwSize) mxGetScalar(prhs[4]);
    t_arr    = mxGetPr(prhs[2]);
    t_array = mxGetPr(prhs[5]);

The available indexes for prhs are 0,1,2,3,4 since you are calling the B mex routine with five inputs. That last line above uses prhs[5] which means you are reading beyond the valid indexing for prhs, so you get a memory access error and a seg fault. What you probably need to use is this:

    l  = (mwSize) mxGetScalar(prhs[0]);
    m     = (mwSize) mxGetScalar(prhs[2]);
    p     = (mwSize) mxGetScalar(prhs[3]);
    t_arr    = mxGetPr(prhs[1]);
    t_array = mxGetPr(prhs[4]);

Best Answer

Related Solutions

MATLAB: MEX problem with mxGetData

MATLAB: MATLAB crashes when running mex file with recursive function

Related Question