GDAL – Should JP2 or Cloud Optimized GeoTIFFs Be Used for Accessing Small Raster Windows on Amazon S3

amazon s3gdaljpeg 2000rasterio

I am building a system where we have a lot of large rasters (Sentinel-2 bands) stored in S3. A lot of this data is stored in our own buckets so we can store whatever format we find most usable.

I need to efficiently and often access small windows (often less than 500×500 pixels) from those rasters. Since this happens a lot I need to be able to download those small windows fast and without tranfering more data than i need.

For this i am using vsis3 which solves this problem in a nice way, just as explained in the answer here.

I am using GDAL 2.2.4 from conda-forge which seems to use OPENJPEG 2.3.0.

The problem

I see 2 choices of how i can store my data.

  1. As cloud optimized geotiffs
  2. As JPEG 2000

The nice properties of option 2 is that the filesizes are a lot smaller, and that all data which we have not processed on our own can be bulled directly from amazons own S3 bucket

However the problem with option 2 is that it seems to download much more data to access the window than if i am using option 1.

I can see that the jp2 files for sentinel-2 images by default has tilesize 1024. So i created a cloud optimized geotiff with a tilesize of 256. This performs much better (in terms of how much data i have to download), so i expected the tilesize to be the reason. However I then tried to make a cloud optimized geotiff with a tilesize of 1024 and again it performs much better than the .jp2 file.

Here is a very crude visualization of the data transfer for each file
enter image description here

This show the data transfer required to fetch a 100×100 pixel window from a single band raster from each file type.

Now here is the question

Why do I have to download so much more data, when i try to access the same window from a jp2 file than from a geotiff file? It does not seem to be the tilesize, so what is the extra data i am downloading, and can i somehow avoid it?

I am just using the OpenJPEG driver, can this be the problem? and will a proprietary driver solve the problem i am describing?

Or do i simply have to bite the bullet and use cloud optimized geotiffs to access the windows faster, with the cost of some extra file size?

Best Answer

Why do I have to download so much more data, when i try to access the same window from a jp2 file than from a geotiff file? It does not seem to be the tilesize, so what is the extra data i am downloading, and can i somehow avoid it?

As I understand it, the JPEG2000 file layout is progressive - from lowest resolution to highest, and its compression means that to read a single tile at full resolution, you need to read through the progression for that tile. This means doing several file reads which on S3 means several HTTP requests — transferring more data and/or taking longer.

Cloud Optimized GeoTIFFs encode all metadata information in the file header for a single read, and getting a tile involves only one additional read/request. The CloudOptimizedGeoTIFF wiki page has more details. There's no relation between different overview levels so only the closest matching one needs to be read.

I am just using the OpenJPEG driver, can this be the problem? and will a proprietary driver solve the problem i am describing?

Kakadu is at least as bad in my experience, if not worse. One of the core GDAL developers who has developed the Cloud Optimised GeoTIFF approach has been working on OpenJPEG, so I suspect OpenJPEG will be the best optimised.

Or do i simply have to bite the bullet and use cloud optimized geotiffs to access the windows faster, with the cost of some extra file size?

Several compression methods are supported in TIFF files. If you need lossless compression, try the ZSTD support in newer GDAL versions. Otherwise, JPEG compression with YCbCr performs very well and file sizes will likely be similar to JP2. See the GeoTIFF format page for more details on available options. Tiles are all compressed independently. Remember to compress your overviews too!

Related Question