[GIS] Iterating an Earth Engine calculation (reducer) over every country in the world: a simple example

big datagoogle-earth-engine

I'm an EE newbie, and would like to create a detailed 'resource' for the following challenge.

The problem: What is the best way to iterate a simple calculation (e.g. a Reducer, or a simple user-defined function) over a very large feature collection? In this context, 'best' means (1) the computation works (does not throw either a 'memory limits exceeded' or a 'computation timed out error'), and (2) works fast and in EE way (<=> does not consume server-side resources unnecessarily).

A list of related-but-not-useful questions found to date is appended.

A reproducible example:
The problem can be demonstrated using pre-loaded EE assets. Let's try to iterate over a vector dataset (an EE 'feature collection') of the world's national boundaries in order to count the area of forest loss recorded by Hansen et al (2017 update) for each country in the world:

var hansen = ee.Image("UMD/hansen/global_forest_change_2017_v1_5");
var countries = ee.FeatureCollection("USDOS/LSIB/2013");
//this is a detailed map of country boundaries for the whole world
var simpleCountries = ee.FeatureCollection("USDOS/LSIB_SIMPLE/2017");
//the same thing, but simplified. Small islands and other complicated vector features do not appear
var fourCountries = countries.filter(ee.Filter.inList('name',['SLOVAKIA','BURKINA FASO','BHUTAN','CONGO (Brazzaville)']));
//this is a test collection of four small countries, with relatively simple geometry, plus Congo (to check vs 'Congo' result [in ee tutorial][5])

var lossImage=hansen.select(['loss']) //all Hansen 'loss' pixels, 2000-2017
//these are 30m pixels showing 'forest loss', which we will try to count
var lossIn2012=hansen.select(['lossyear']).eq(12)
//to make things easy, we will work with just one year's loss data at a time
var areaImage=lossIn2012.multiply(ee.Image.pixelArea())
//create a layer where values are pixel area


//the following call (with just 4 small countries) succeeds in less than a minute
//the area reported for Congo (in 2012) accords with the area in the tutorial. Nice!
var smallReduction=areaImage.reduceRegions({
  collection:fourCountries,
  reducer:'sum'
})
print('smallReduction', smallReduction)

//this call fails ("computation timed out")
var hugeReduction=areaImage.reduceRegions({
  collection:countries,
  reducer:'sum'
})
print('hugeReduction', hugeReduction)

//following [the debugging guide][5], we can use map with 
//reduceRegion to break the reduction into smaller tasks.
//this call also fails ("computation timed out")
var mappedReduction=countries.map(function(feature){
  return feature.set(areaImage.reduceRegion({
    reducer:'sum',
    geometry:feature.geometry(),
    maxPixels:1e15 //default is 1e9, which is exceeded by an early country in the list (Brazil?), throwing an error
  }))
})
print('mappedReduction', mappedReduction)

//following debugging guide [again][5] we can export the result...this doesn't get us more memory, but
//will stop the computation from timing out
//killed it after 22 hours of runtime...
Export.table.toDrive({
  collection:mappedReduction,
  description:'mappedReduction',
  fileFormat:'CSV'
})

What gives? Yes, this is a whole lot of pixels – but we're here (maybe learning JavaScript) on the promise of 'planetary scale analysis'. So how can we run planetary-scale analyses in a way that jives with the resources the wonderful folks are Google are able to make available? I (newbie) think answering the following sub-questions might help:

  • Can the calculation task (ee.Reducer.sum()) broken up or slowed down
    to allow the computation to succeed? [edit] I think this merits it's
    own question, so have opened one here. If the script in this question fails because some features are just too big, the problem
    can be overcome by using larger scale feature collection (but do see
    that question – EE appears to choke on features of a reasonable size
    for 'planetary analysis')
  • If we use the bestEffort or tileScale parameters in the call to
    reduceRegion, how can we make the result (and any resulting
    trade-off in accuracy) verbose? I've played with both (no
    luck), but can't see what's going on … or where they fail.
  • Can export be told to create a separate output file for each
    feature, so we can see the progress of a multi-hour operation
    (and see if it fails on a specific feature)?

Related-but-not-useful questions have been asked

  • here (OP asked about memory limits when mapping over a large feature collection and was directed to seek help in a closed google group)
  • here (OP asked basically the same question; it has been ignored)
  • here (OP wanted to use reduce region on a feature collection
    where memory / time limits were a risk – neither issue arose, and no
    workarounds are provided)
  • here (OP wanted to filter and sort an image collection for an arbitrary (and large) number of vector geometries, and export the
    result. The discussion appears to show how to set up such a script,
    but doesn't address memory or computation time limits)
  • here (OP mapped a reducer over a feature collection, but did not discuss memory or time limits)
  • here (OP is using the python API and iterating with a for() loop; memory or time limits don't come up).
  • and here (OP is iterating over years in the same example dataset used in this question; memory and time limits don't come up)

Best Answer

The primary problem you're encountering is that the Large Scale International Boundary table contains several very complex polygons and those complex polygons take a long time to reproject (e.g.: the Canada polygon contains 193k vertices). That's compounded by the fact that the reprojection is repeated for every tile that's processed, so even though the reduceRegion(s) doesn't fail, it can take a (very) long time to complete due to the extra overhead added to every tile.

For extremely long aggregations like this, the mapped reduceRegion is better because each feature that's processed gets cached, so if you end up having to restart, the intermediate values aren't lost.