[GIS] Understanding the s2 library (geometry on the sphere, cells and Hilbert curve) for finding points nearby

geohashgeometrygooglespherical-geometry

I am having trouble understanding how the S2 library works so things in this question may make no sense, but I have a goal I would like to accomplish. Lets say I have many lat/lng point I want to store in a database. The database would be a simple key/value type database (e.g get("MyCoolKey")->returns->"MyCoolValue" ). Later based of a lat/lng point I want to lookup in the database to get all the nearest points (say within 100km).

If I was doing something similar with geohashes I would convert my lat/lng point into a geohash:

48.669, -4.329 -> gbsuv

Then I would store this geohash in the database. I would do the same for many other points. Then to find nearby points I would query for all points in that geohash and all the neighbor:

gbsvh   gbsvj   gbsvn
gbsuu   gbsuv   gbsuy
gbsus   gbsut   gbsuw

I have the following functions I can use to accomplish this goal of mine using s2: https://github.com/mapbox/node-s2/blob/master/API.md .

From what I understand I need to convert the lat/lng points into a cellid? Then I will be able to somehow get nearby points?

Using these available functions how would I then:

Take a lat/lng point and "convert" it
Then based of some lat/lng point find other points in the database

Best Answer

The S2 Geometry library is complex, but for your needs you can use a partial implementation, as some Javascript port.

You can use NodeJS installing the s2-geometry package (e. g. by npm install s2-geometry), or this CDN link for HTML pages.

Short answer

Use the S2.latLngToKey(lat,lon,level) to obtain key=YourCoolValue of your point

    var vals = [
      S2.latLngToKey(48.669, -4.329, 20), // reference
      S2.latLngToKey(48.668, -4.330, 20), // near
      S2.latLngToKey(49, -4.3, 20)        // far
    ]

The key is like a base4 Geohash or a tile-quadkey. It is a string with the face of the S2 Cube and face_pos, the hierarchical position (base4) of the cell in the hierarchical (Hilbert) grid. The result of this example is

[ '2/10002200003102120322',
  '2/10002200003131222211',
  '2/10002130111010302012' ]

Where you can see same big prefix for two near points, 2/100022000031 and only little prefix when the points are not near, 2/10002.

About the distance that the commom prefix represents, check the S2 cell Statistics: level 13 seems adequate for check 1km, that is the 13-digits

[ '2/1000220000310',
  '2/1000220000313',
  '2/1000213011101' ]

Tests

I done my tests using the Javascript S2 geometry library, and starting with a sample of well-known places, obtained by this Wikidata-query:

Group QID       Name                     latitude  longitude

A1    Q178114   Washington Monument      38.88948  -77.035244
A2    Q17300119 Josephine Shaw Fountain  40.754    -73.9841
A3    Q1060845  Delacorte Theater        40.7801   -73.968767
A3    Q160409   Central Park             40.7825   -73.966111
A3    Q19473784 Alexander Hamilton       40.781028 -73.964556
B     Q208760   Merlion                  1.2870222  103.854689
B     Q6819812  Merlion Park             1.28683    103.855

Note: to add other objects or check more details use QID. For instance the first sample have QID=178114 so you can check details with http://wikidata.org/entity/ URL: http://wikidata.org/entity/Q178114

Complete Javascript code, with sample data and illustrating options for encoding:

'use strict';
var S2 = require('s2-geometry').S2; // this line for NodeJS

var geosamples= [
  {"g":"A1","item":"Q178114",  "iso3166":"US","lat":"38.889475","lon":"-77.035244444","name":"Washington Monument"},
  {"g":"A2","item":"Q17300119","iso3166":"US","lat":"40.754","lon":"-73.9841","name":"Josephine Shaw Lowell Memorial Fountain"},
  {"g":"A3","item":"Q1060845", "iso3166":"US","lat":"40.7801","lon":"-73.968766666","name":"Delacorte Theater"},
  {"g":"A3","item":"Q160409",  "iso3166":"US","lat":"40.7825","lon":"-73.966111111","name":"Central Park"},
  {"g":"A3","item":"Q19473784","iso3166":"US","lat":"40.781027777","lon":"-73.964555555","name":"Alexander Hamilton"},
  {"g":"B", "item":"Q208760",  "iso3166":"SG","lat":"1.287022222","lon":"103.854688888","name":"Merlion"},
  {"g":"B", "item":"Q6819812", "iso3166":"SG","lat":"1.28683","lon":"103.855","name":"Merlion Park"}
];

function show(level=13) {
  console.log("--- Level ",level)
  let face,face_pos;
  let j=1
  for (const i of geosamples) {
        var key = S2.latLngToKey(i.lat, i.lon, level); // base4 hiearchy
        let id = BigInt( S2.keyToId(key) ); // Cell ID is a 64 bits integer
        let idHex = id.toString(16)  // compact human-readable complete code
        console.log(i.g+"\t"+j,i.name.slice(0,12)+":\t" + idHex+"\t" key);
        j++
  }
}
show();
show(20);

Results:

g   Sample          idHex              Face/Face_pos 

--- Level  13
A1  1 Washington M: 89b7b7a400000000    4/1031233123310
A2  2 Josephine Sh: 89c259ac00000000    4/1032010230311
A3  3 Delacorte Th: 89c2589400000000    4/1032010230102
A3  4 Central Park: 89c2589c00000000    4/1032010230103
A3  5 Alexander Ha: 89c2589400000000    4/1032010230102
B   6 Merlion:      31da190c00000000    1/2032310030201
B   7 Merlion Park: 31da190c00000000    1/2032310030201
--- Level  20
A1  1 Washington M: 89b7b7a1bdf00000    4/10312331233100313233
A2  2 Josephine Sh: 89c259aa96100000    4/10320102303111102300
A3  3 Delacorte Th: 89c25891ac900000    4/10320102301020311210
A3  4 Central Park: 89c2589a08900000    4/10320102301031001010
A3  5 Alexander Ha: 89c2589747d00000    4/10320102301023220332
B   6 Merlion:      31da19085c300000    1/20323100302010023201
B   7 Merlion Park: 31da190860700000    1/20323100302010030003

The column idHex is the standard hexadecimal (base16) representation of the S2 Geometry Cell identifier (cell ID). The column Face/Key is an alternative representation of the same cell ID.

Each cell of the S2 Geometry grid have an unique number (64 bits unsigned integer) used to identify ifself, that is a structure of bits expressing the position in the Hilbert curve (Face_pos) of a cube face.

You can use as geocode (said "MyCoolValue" in the question) any one, the idHex or the key=Face/Face_pos.

Cell ID and Cell Key have the same information, but ID mix the face and face_pos, and key not. Face is explicit and each digit of face_pos is a level in the grid structure.

Complete answer

The main advantage of S2 Geometry over Geohash is uniformity, the (near) constant shape and area of the S2 Geometry cells. A grid of equal-area is very important in statistics and another applications, see this Open Geospatial Consortium standard about the theme.

There are some (minor) advantages of Hilbert curve (S2) over Z-order curve (Geohash), but no one is perfect... In both, is possible two neighbors in the grid that have identifiers (keys) very different.

For application where you need 100% reliable result, use also the functions like GetEdgeNeighbors() of the s2-geometry package.

Suggestion for base16 encoding

About convert "Face/Face_pos" to hexadecimal, is possible for example convert the base4 10002200 to hexa 40a0, but base4 with more one digit will result in entirely different (or invalid) hexadecimal; for example 100022003 results in 102801... The "more one digit" ocurrs when you compare for example level 18 keys with level 19 keys.

To avoid this problem, an extra digit must be added by a extend base16 algorithm.

function base4_to_base16(str) {
  const tr = {"0":"00","1":"01","2":"10","3":"11"}
  let strBin=''
  for(let i of str.split('')) strBin += tr[i]
  return BigInt('0b'+strBin).toString(16)
} 

function base4_to_base16h(str) {
  const tr = {"0":"J","1":"K","2":"L","3":"M"}
  let len = str.length
  if (len % 2 == 0)
    return base4_to_base16(str);
  let h = base4_to_base16( str.slice(0,-1) )  // cut last
  return h+tr[ str.slice(-1) ]
} 

var key = S2.latLngToKey(i.lat, i.lon, level);
[face,face_pos] = key.split('/');
keyHex = face + base4_to_base16h(face_pos)

Comparative results

g   Sample           Face/Face_pos       = keyHex       idHex

--- Level  18
A1  1 Washington M: 4/103123312331003132 = 44dbdbd0de   89b7b7a1bd000000
A2  2 Josephine Sh: 4/103201023031111023 = 44e12cd54b   89c259aa97000000
A3  3 Delacorte Th: 4/103201023010203112 = 44e12c48d6   89c25891ad000000
A3  4 Central Park: 4/103201023010310010 = 44e12c4d04   89c2589a09000000
A3  5 Alexander Ha: 4/103201023010232203 = 44e12c4ba3   89c2589747000000
B   6 Merlion   :   1/203231003020100232 = 18ed0c842e   31da19085d000000
B   7 Merlion Park: 1/203231003020100300 = 18ed0c8430   31da190861000000
--- Level  19
A1  1 Washington M: 4/1031233123310031323 = 44dbdbd0deM 89b7b7a1bdc00000
A2  2 Josephine Sh: 4/1032010230311110230 = 44e12cd54bJ 89c259aa96400000
A3  3 Delacorte Th: 4/1032010230102031121 = 44e12c48d6K 89c25891acc00000
A3  4 Central Park: 4/1032010230103100101 = 44e12c4d04K 89c2589a08c00000
A3  5 Alexander Ha: 4/1032010230102322033 = 44e12c4ba3M 89c2589747c00000
B   6 Merlion   :   1/2032310030201002320 = 18ed0c842eJ 31da19085c400000
B   7 Merlion Park: 1/2032310030201003000 = 18ed0c8430J 31da190860400000
--- Level  22
A1  1 Washington M: 4/1031233123310031323331 = 44dbdbd0defd 89b7b7a1bdfb0000
A2  2 Josephine Sh: 4/1032010230311110230000 = 44e12cd54b00 89c259aa96010000
A3  3 Delacorte Th: 4/1032010230102031121011 = 44e12c48d645 89c25891ac8b0000
A3  4 Central Park: 4/1032010230103100101000 = 44e12c4d0440 89c2589a08810000
A3  5 Alexander Ha: 4/1032010230102322033211 = 44e12c4ba3e5 89c2589747cb0000
B   6 Merlion   :   1/2032310030201002320132 = 18ed0c842e1e 31da19085c3d0000
B   7 Merlion Park: 1/2032310030201003000332 = 18ed0c84303e 31da1908607d0000

The idHex can be reduced with the trailing 0's removed (e.g. 89b7b7a1bd000000 reduced to 89b7b7a1bd). The information into keyHex and idHex are same, differing only in one bit.

The keyHex is better: represents the same cell with the same number of digits, but with a evident relationship with "Face/Face_pos" and always preserving prefix, with no distortions. The hexadecimal (base16) representation is aligned with the base4.

Notice the folowing idHex in the table, to see the distortions:

sample A1 (no distorcion). Level 18 is 89b7b7a1bd and level 19 is 89b7b7a1bdc, so concatenating one digit c.
sample A2 (distorcion). Level 18 is 89c259aa97 and level 19 is 89c259aa964, changed final 7 to 6 and concatenate 4.

Related Solutions

Geohash for Proximity Searches – GIS Principles and Techniques

Absolutely you can. And it can be quite fast. (The intensive computation bits can ALSO be distributed)

There are several ways, but one way that I've been working with is in using an ordered list of integer-based geohashes, and finding all the nearest neighbour geohash ranges for a specific geohash resolution (the resolution approximates your distance criteria), and then querying those geohash ranges to get a list of nearby points. I use redis and nodejs (ie. javascript) for this. Redis is super fast and can retrieve ordered ranges very quickly, but it can't do a lot of the indexing query manipulation stuff that SQL databases can do.

The method is outlined here: https://github.com/yinqiwen/ardb/wiki/Spatial-Index

But the gist of it is (to paraphrase the link):

You store all your geohashed points in the best resolution you want (max usually 64bit integer if that's accessible, or in the case of javascript, 52bits) in an ordered set (ie. zset in redis). Most geohash libraries these days have geohash integer functions built in, and you'll need to use these instead of the more common base32 geohashes.
Based on the radius you want to search within, you need to then find a bit depth/resolution that will match your search area and this must be less than or equal to your stored geohash bit depth. The linked site has a table that correlates the bit depth of a geohash to its bounding box area in meters.
Then you rehash your original coordinate at this lower resolution.
At that lower resolution also find the 8 neighbour (n, ne, e, se, s, sw, w, nw) geohash areas. The reason why you have to do the neighbour method, is because two coordinates nearly right beside each other could have completely different geohashes, so you need to do some averaging of the area covered by the search.
Once you get all the neighbour geohashes at this lower resolution, add to the list your coordinate's geohash from step 3.
Then you need to build a range of geohash values to search within which cover these 9 areas. The values from step 5 are your lower range limit, and if you add 1 to each of them, you'll get your upper range limit. So you should have an array of 9 ranges, each with a lower limit and and upper geohash limit (18 geohashes in total). These geohashes are still in that lower resolution from step 2.
Then you convert all 18 of these geohashes to whatever bit depth/resolution you have stored all your geohashes in your database in. Generally you do this by bitshifting it to the desired bit depth.
Now you can do a range query for points within these 9 ranges and you'll get all points approximately within the distance of your original point. There will be no overlap so you don't need to do any intersections, just pure range queries, very fast. (ie. in redis: ZRANGEBYSCORE zsetname lowerLimit upperLimit, over the 9 ranges produced in this step)

You can further optimize (speed wise) this by:

Taking those 9 ranges from step 6 and finding where they lead into each other. Usually you can reduce 9 separate ranges into about 4 or 5 depending on where your coordinate is. This can reduce your query time by half.
Once you have your final ranges, you should hold them for reuse. The calculation of these ranges can take most of the processing time, so if your original coordinate doesn't change much but you need to make the same distance query over again, you should keep that ready instead of calculating it everytime.
If you're using redis, try to combine the queries into a MULTI/EXEC so it pipelines them for a bit better performance.
The BEST part: You can distribute steps 2-7 on clients instead of having that computation done all in one place. This greatly reduces CPU load in situations where millions of requests would be coming in.

You can further improve accuracy by using a circle distance/haversine type function on the returned results if you care much about precision.

Here's a similar technique using ordinary base32 geohashes and a SQL query instead of redis: https://github.com/davetroy/geohash-js

I don't mean to plug my own thing, but I've written a module for nodejs&redis that makes this really easy to implement. Have a look at the code if you'd like: https://github.com/arjunmehta/node-georedis

[GIS] Specifying a rectangle as two latitude/longitude pairs, one for center and one for size

From the code it appears that they are using the size to construct half of the rectangle, and then possibly mirroring this half rectangle about the center point, thereby creating the full rectangle.

func RectFromCenterSize(center, size LatLng) Rect {
        half := LatLng{size.Lat / 2, size.Lng / 2}
        return RectFromLatLng(center).expanded(half)