I have a .tif that I read into an array (call it tifArray), and I would like to classify the array based on set of conditions:
- Where 1200 <= tifArray <= 4000, outputArray = 1
- Where tifArray < 1200, outputArray = 2
- Where tifArrary > 4000, outputArray = 3
I've tried both creating a new array (preferred) and replacing the values in place (see below) but I regardless I get a MemoryError. I've tried on machines that have 10 and 8 GB of free RAM. I've also tried using the np.where function and just plain indexing (below). I have no clue why I'm getting a memory error, and I don't think I should be running into that problem.
Some information about tifArray:
tifArray.shape = (55500, 55500)
tifArray.dtype = uint16
And here is one method I tried:
threshold_low = 1200
threshold_high = 4000
tifArray[(tifArray >= threshold_low) & (tifArray <= threshold_high)] = 1
tifArray[(tifArray < threshold_low)] = 2
tifArray[(tifArray > threshold_high)] = 3
The error:
File "./classify_segmented_fromAmplitude.py", line 90, in process_tile
tifArray[(tifArray >= threshold_low) & (tifArray <= threshold_high)] = 1
MemoryError
When I comment out the first condition:
tifArray[(tifArray >= threshold_low) & (tifArray <= threshold_high)] = 1
and just run the following two lines, I get no Memory Error. So obviously the problem is with the multiple conditions, but I don't know how to get around it. Any ideas?
Best Answer
This comes down to efficient chaining of boolean operations. Each comparison (
tifArray > threshold
) yields a temporary boolean array with the same dimensions astifArray
, which also consumes memory (since bool is stored asint8
it needs half the memory of auint16
array).An array of type
uint16
andsize=(55500, 55500)
takes up ~6 Gb of memory. So on a 10 Gb machine you can get away with one comparison at a time (6 GbtifArray
+ 3 Gb intermediate bool array).The comparison
(tifArray >= threshold_low) & (tifArray <= threshold_high)
yields three temporary arrays, which are 1.5 times the size oftifArray
- more than youre machines can handle.In addition this also creates a lot of
1
entries before you compare the lower threshold and effectively set all of them to2
. Even if your code would run the result would be wrong.One way to solve all this is first test the lower, then the upper threshold and then set everything above
3
to1
- since the only remaining values that are not2
or3
are between the lower and upper threshold.