[GIS] Optimisation of MySQL JOIN based on a Within match

attribute-joinsMySQLwithin

I have a table of points and boundaries, and am trying to add the boundary ID which a point is within. However, using a LEFT JOIN ON Within(location, boundary) it is taking about 3.5 hours match 450,000 points against 350 boundaries. Is there a way to optimise this join?

In more detail:

I have two tables in MySQL 5.6, one of which contains points, each stored as a point and the other containing boundaries, each stored as a geometry:

-- Table of locations, around 0.5 million points
CREATE TABLE locations (
  id INT(11) NOT NULL PRIMARY KEY,
  longitude float(11,6) DEFAULT NULL,
  latitude float(10,6) DEFAULT NULL,
  lonLat point NOT NULL DEFAULT '',
  boundaryId INT(11) DEFAULT NULL
) ENGINE=MyISAM;

-- Populate the lonLat field
UPDATE locations SET lonLat = POINTFROMTEXT(CONCAT('point(',longitude,' ',latitude,')')) WHERE longitude IS NOT NULL AND latitude IS NOT NULL;

-- Add spatial index on lonLat
ALTER TABLE locations ADD SPATIAL INDEX(lonLat);

-- Table of around 350 exact boundaries, some overlapping
CREATE TABLE IF NOT EXISTS boundaries (
  id INT(11) NOT NULL PRIMARY KEY,
  llgeom geometry NOT NULL
);

-- Add spatial index on boundary llgeom:
ALTER TABLE `boundaries` ADD SPATIAL(`llgeom`);

I have a query which updates the location table with the boundary ID that the point for that row is within:

UPDATE locations
LEFT JOIN boundaries ON Within(lonLat, llgeom)
SET boundaryId = boundaries.id;

Note that both lonLat and llgeom both have spatial indexes on them already.

With around 450,000 points and 350 geometries, running on MySQL 5.6, this takes around 3.5 hours. Doing a test limited to just 14 rows takes about 2.1 seconds.

If I run an EXPLAIN, this shows that no indexing is being used:

mysql> explain UPDATE locations LEFT JOIN boundaries ON Within(lonLat, llgeom) SET boundaryId = boundaries.id;
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | locations | ALL | NULL | NULL | NULL | NULL | 451010 | NULL |
| 1 | SIMPLE | boundaries | ALL | NULL | NULL | NULL | NULL | 353 | Using where; Using join buffer (Block Nested Loop) |

This shows that type is ALL in both cases, which is "the worst join type and usually indicates the lack of appropriate indexes on the table."

Is there some improvement I can make which will give much better performance, using indexes?

NB Using the ST_Within function (which gives true boundaries, rather than simplified bounding-box matching) for those same 14 rows takes much longer, 83 seconds:

UPDATE locations
LEFT JOIN boundaries ON ST_Within(lonLat, llgeom)
SET boundaryId = boundaries.id;

However, I have a routine called reallyWithin which has the same result but takes about 2.3 seconds. But whichever of the three functions is used (the bbox Within, the procedure reallyWithin, or the official ST_Within), this works out too slow for 450,000 points.

Best Answer

I had similar problem.
I solved with a procedure.
Try:

BEGIN
  DECLARE b, loc_id INT;
  DECLARE loc_point point;
  DECLARE cur_1 CURSOR FOR SELECT lonLat, id FROM locations;
  DECLARE CONTINUE HANDLER FOR NOT FOUND
  SET b = 1;
  OPEN cur_1;
  REPEAT
    FETCH cur_1 INTO loc_point, loc_id;
    BEGIN
        DECLARE a TEXT;
        DECLARE cur_2 CURSOR FOR SELECT id FROM boundaries WHERE Within(loc_point, boundaries.llgeom);
        OPEN cur_2;
        FETCH cur_2 INTO a;
        UPDATE locations SET boundaryId = a WHERE id = loc_id;
        CLOSE cur_2;
    END;
    UNTIL b = 1
  END REPEAT;
  CLOSE cur_1;
END
Related Question