[GIS] Understanding the s2 library (geometry on the sphere, cells and Hilbert curve) for finding points nearby

geohashgeometrygooglespherical-geometry

I am having trouble understanding how the S2 library works so things in this question may make no sense, but I have a goal I would like to accomplish. Lets say I have many lat/lng point I want to store in a database. The database would be a simple key/value type database (e.g get("MyCoolKey")->returns->"MyCoolValue" ). Later based of a lat/lng point I want to lookup in the database to get all the nearest points (say within 100km).

If I was doing something similar with geohashes I would convert my lat/lng point into a geohash:

48.669, -4.329 -> gbsuv

Then I would store this geohash in the database. I would do the same for many other points. Then to find nearby points I would query for all points in that geohash and all the neighbor:

gbsvh   gbsvj   gbsvn
gbsuu   gbsuv   gbsuy
gbsus   gbsut   gbsuw

I have the following functions I can use to accomplish this goal of mine using s2: https://github.com/mapbox/node-s2/blob/master/API.md .

From what I understand I need to convert the lat/lng points into a cellid? Then I will be able to somehow get nearby points?

Using these available functions how would I then:

  1. Take a lat/lng point and "convert" it
  2. Then based of some lat/lng point find other points in the database

Best Answer

The S2 Geometry library is complex, but for your needs you can use a partial implementation, as some Javascript port.

You can use NodeJS installing the s2-geometry package (e. g. by npm install s2-geometry), or this CDN link for HTML pages.

Short answer

Use the S2.latLngToKey(lat,lon,level) to obtain key=YourCoolValue of your point

    var vals = [
      S2.latLngToKey(48.669, -4.329, 20), // reference
      S2.latLngToKey(48.668, -4.330, 20), // near
      S2.latLngToKey(49, -4.3, 20)        // far
    ]

The key is like a base4 Geohash or a tile-quadkey. It is a string with the face of the S2 Cube and face_pos, the hierarchical position (base4) of the cell in the hierarchical (Hilbert) grid. The result of this example is

[ '2/10002200003102120322',
  '2/10002200003131222211',
  '2/10002130111010302012' ]

Where you can see same big prefix for two near points, 2/100022000031 and only little prefix when the points are not near, 2/10002.

About the distance that the commom prefix represents, check the S2 cell Statistics: level 13 seems adequate for check 1km, that is the 13-digits

[ '2/1000220000310',
  '2/1000220000313',
  '2/1000213011101' ]

Tests

I done my tests using the Javascript S2 geometry library, and starting with a sample of well-known places, obtained by this Wikidata-query:

Group QID       Name                     latitude  longitude

A1    Q178114   Washington Monument      38.88948  -77.035244
A2    Q17300119 Josephine Shaw Fountain  40.754    -73.9841
A3    Q1060845  Delacorte Theater        40.7801   -73.968767
A3    Q160409   Central Park             40.7825   -73.966111
A3    Q19473784 Alexander Hamilton       40.781028 -73.964556
B     Q208760   Merlion                  1.2870222  103.854689
B     Q6819812  Merlion Park             1.28683    103.855

Note: to add other objects or check more details use QID. For instance the first sample have QID=178114 so you can check details with http://wikidata.org/entity/ URL: http://wikidata.org/entity/Q178114

Complete Javascript code, with sample data and illustrating options for encoding:

'use strict';
var S2 = require('s2-geometry').S2; // this line for NodeJS

var geosamples= [
  {"g":"A1","item":"Q178114",  "iso3166":"US","lat":"38.889475","lon":"-77.035244444","name":"Washington Monument"},
  {"g":"A2","item":"Q17300119","iso3166":"US","lat":"40.754","lon":"-73.9841","name":"Josephine Shaw Lowell Memorial Fountain"},
  {"g":"A3","item":"Q1060845", "iso3166":"US","lat":"40.7801","lon":"-73.968766666","name":"Delacorte Theater"},
  {"g":"A3","item":"Q160409",  "iso3166":"US","lat":"40.7825","lon":"-73.966111111","name":"Central Park"},
  {"g":"A3","item":"Q19473784","iso3166":"US","lat":"40.781027777","lon":"-73.964555555","name":"Alexander Hamilton"},
  {"g":"B", "item":"Q208760",  "iso3166":"SG","lat":"1.287022222","lon":"103.854688888","name":"Merlion"},
  {"g":"B", "item":"Q6819812", "iso3166":"SG","lat":"1.28683","lon":"103.855","name":"Merlion Park"}
];

function show(level=13) {
  console.log("--- Level ",level)
  let face,face_pos;
  let j=1
  for (const i of geosamples) {
        var key = S2.latLngToKey(i.lat, i.lon, level); // base4 hiearchy
        let id = BigInt( S2.keyToId(key) ); // Cell ID is a 64 bits integer
        let idHex = id.toString(16)  // compact human-readable complete code
        console.log(i.g+"\t"+j,i.name.slice(0,12)+":\t" + idHex+"\t" key);
        j++
  }
}
show();
show(20);

Results:

g   Sample          idHex              Face/Face_pos 

--- Level  13
A1  1 Washington M: 89b7b7a400000000    4/1031233123310
A2  2 Josephine Sh: 89c259ac00000000    4/1032010230311
A3  3 Delacorte Th: 89c2589400000000    4/1032010230102
A3  4 Central Park: 89c2589c00000000    4/1032010230103
A3  5 Alexander Ha: 89c2589400000000    4/1032010230102
B   6 Merlion:      31da190c00000000    1/2032310030201
B   7 Merlion Park: 31da190c00000000    1/2032310030201
--- Level  20
A1  1 Washington M: 89b7b7a1bdf00000    4/10312331233100313233
A2  2 Josephine Sh: 89c259aa96100000    4/10320102303111102300
A3  3 Delacorte Th: 89c25891ac900000    4/10320102301020311210
A3  4 Central Park: 89c2589a08900000    4/10320102301031001010
A3  5 Alexander Ha: 89c2589747d00000    4/10320102301023220332
B   6 Merlion:      31da19085c300000    1/20323100302010023201
B   7 Merlion Park: 31da190860700000    1/20323100302010030003

The column idHex is the standard hexadecimal (base16) representation of the S2 Geometry Cell identifier (cell ID). The column Face/Key is an alternative representation of the same cell ID.

Each cell of the S2 Geometry grid have an unique number (64 bits unsigned integer) used to identify ifself, that is a structure of bits expressing the position in the Hilbert curve (Face_pos) of a cube face.

You can use as geocode (said "MyCoolValue" in the question) any one, the idHex or the key=Face/Face_pos.

Cell ID and Cell Key have the same information, but ID mix the face and face_pos, and key not. Face is explicit and each digit of face_pos is a level in the grid structure.


Complete answer

The main advantage of S2 Geometry over Geohash is uniformity, the (near) constant shape and area of the S2 Geometry cells. A grid of equal-area is very important in statistics and another applications, see this Open Geospatial Consortium standard about the theme.

There are some (minor) advantages of Hilbert curve (S2) over Z-order curve (Geohash), but no one is perfect... In both, is possible two neighbors in the grid that have identifiers (keys) very different.

For application where you need 100% reliable result, use also the functions like GetEdgeNeighbors() of the s2-geometry package.

Suggestion for base16 encoding

About convert "Face/Face_pos" to hexadecimal, is possible for example convert the base4 10002200 to hexa 40a0, but base4 with more one digit will result in entirely different (or invalid) hexadecimal; for example 100022003 results in 102801... The "more one digit" ocurrs when you compare for example level 18 keys with level 19 keys.

To avoid this problem, an extra digit must be added by a extend base16 algorithm.

function base4_to_base16(str) {
  const tr = {"0":"00","1":"01","2":"10","3":"11"}
  let strBin=''
  for(let i of str.split('')) strBin += tr[i]
  return BigInt('0b'+strBin).toString(16)
} 

function base4_to_base16h(str) {
  const tr = {"0":"J","1":"K","2":"L","3":"M"}
  let len = str.length
  if (len % 2 == 0)
    return base4_to_base16(str);
  let h = base4_to_base16( str.slice(0,-1) )  // cut last
  return h+tr[ str.slice(-1) ]
} 

var key = S2.latLngToKey(i.lat, i.lon, level);
[face,face_pos] = key.split('/');
keyHex = face + base4_to_base16h(face_pos)

Comparative results

g   Sample           Face/Face_pos       = keyHex       idHex

--- Level  18
A1  1 Washington M: 4/103123312331003132 = 44dbdbd0de   89b7b7a1bd000000
A2  2 Josephine Sh: 4/103201023031111023 = 44e12cd54b   89c259aa97000000
A3  3 Delacorte Th: 4/103201023010203112 = 44e12c48d6   89c25891ad000000
A3  4 Central Park: 4/103201023010310010 = 44e12c4d04   89c2589a09000000
A3  5 Alexander Ha: 4/103201023010232203 = 44e12c4ba3   89c2589747000000
B   6 Merlion   :   1/203231003020100232 = 18ed0c842e   31da19085d000000
B   7 Merlion Park: 1/203231003020100300 = 18ed0c8430   31da190861000000
--- Level  19
A1  1 Washington M: 4/1031233123310031323 = 44dbdbd0deM 89b7b7a1bdc00000
A2  2 Josephine Sh: 4/1032010230311110230 = 44e12cd54bJ 89c259aa96400000
A3  3 Delacorte Th: 4/1032010230102031121 = 44e12c48d6K 89c25891acc00000
A3  4 Central Park: 4/1032010230103100101 = 44e12c4d04K 89c2589a08c00000
A3  5 Alexander Ha: 4/1032010230102322033 = 44e12c4ba3M 89c2589747c00000
B   6 Merlion   :   1/2032310030201002320 = 18ed0c842eJ 31da19085c400000
B   7 Merlion Park: 1/2032310030201003000 = 18ed0c8430J 31da190860400000
--- Level  22
A1  1 Washington M: 4/1031233123310031323331 = 44dbdbd0defd 89b7b7a1bdfb0000
A2  2 Josephine Sh: 4/1032010230311110230000 = 44e12cd54b00 89c259aa96010000
A3  3 Delacorte Th: 4/1032010230102031121011 = 44e12c48d645 89c25891ac8b0000
A3  4 Central Park: 4/1032010230103100101000 = 44e12c4d0440 89c2589a08810000
A3  5 Alexander Ha: 4/1032010230102322033211 = 44e12c4ba3e5 89c2589747cb0000
B   6 Merlion   :   1/2032310030201002320132 = 18ed0c842e1e 31da19085c3d0000
B   7 Merlion Park: 1/2032310030201003000332 = 18ed0c84303e 31da1908607d0000

The idHex can be reduced with the trailing 0's removed (e.g. 89b7b7a1bd000000 reduced to 89b7b7a1bd). The information into keyHex and idHex are same, differing only in one bit.

The keyHex is better: represents the same cell with the same number of digits, but with a evident relationship with "Face/Face_pos" and always preserving prefix, with no distortions. The hexadecimal (base16) representation is aligned with the base4.

Notice the folowing idHex in the table, to see the distortions:

  • sample A1 (no distorcion). Level 18 is 89b7b7a1bd and level 19 is 89b7b7a1bdc, so concatenating one digit c.
  • sample A2 (distorcion). Level 18 is 89c259aa97 and level 19 is 89c259aa964, changed final 7 to 6 and concatenate 4.