[GIS] Memory error in python when reclassifying array

numpypython

I have a .tif that I read into an array (call it tifArray), and I would like to classify the array based on set of conditions:

  • Where 1200 <= tifArray <= 4000, outputArray = 1
  • Where tifArray < 1200, outputArray = 2
  • Where tifArrary > 4000, outputArray = 3

I've tried both creating a new array (preferred) and replacing the values in place (see below) but I regardless I get a MemoryError. I've tried on machines that have 10 and 8 GB of free RAM. I've also tried using the np.where function and just plain indexing (below). I have no clue why I'm getting a memory error, and I don't think I should be running into that problem.

Some information about tifArray:

tifArray.shape = (55500, 55500)
tifArray.dtype = uint16

And here is one method I tried:

threshold_low = 1200
threshold_high = 4000
tifArray[(tifArray >= threshold_low) & (tifArray <= threshold_high)] = 1
tifArray[(tifArray < threshold_low)] = 2 
tifArray[(tifArray > threshold_high)] = 3   

The error:

File "./classify_segmented_fromAmplitude.py", line 90, in process_tile
    tifArray[(tifArray >= threshold_low) & (tifArray <= threshold_high)] = 1
MemoryError

When I comment out the first condition:

tifArray[(tifArray >= threshold_low) & (tifArray <= threshold_high)] = 1

and just run the following two lines, I get no Memory Error. So obviously the problem is with the multiple conditions, but I don't know how to get around it. Any ideas?

Best Answer

This comes down to efficient chaining of boolean operations. Each comparison (tifArray > threshold) yields a temporary boolean array with the same dimensions as tifArray, which also consumes memory (since bool is stored as int8 it needs half the memory of a uint16 array).

An array of type uint16 and size=(55500, 55500) takes up ~6 Gb of memory. So on a 10 Gb machine you can get away with one comparison at a time (6 Gb tifArray + 3 Gb intermediate bool array).

The comparison (tifArray >= threshold_low) & (tifArray <= threshold_high) yields three temporary arrays, which are 1.5 times the size of tifArray - more than youre machines can handle.

In addition this also creates a lot of 1 entries before you compare the lower threshold and effectively set all of them to 2. Even if your code would run the result would be wrong.

One way to solve all this is first test the lower, then the upper threshold and then set everything above 3 to 1 - since the only remaining values that are not 2 or 3 are between the lower and upper threshold.

thrs_low = 1200
thrs_high = 4000

tifArray[tifArray < thrs_low] = 2
tifArray[tifArray > thrs_high] = 3
tifArray[tifArray > 3] = 1
Related Question