Matrices – Efficient Ways to Check if Points are Within Polygons for Large Datasets

matricespolygons

Perhaps this question should be elsewhere but as its mainly a mathematical problem Im posting here.

I have a dataset of which looks like this:

[
  [ 242841.86914496083,
  1090.0027001134586,
  11711.344635666988,
  142639.20305005432 ],

  [ 212841.86914496083, 
  128007.0027001134586,
  11711.344635666988,
  142639.20305005432 ]
  ...
]

Each array is called an "event" and in this case have 4 "parameters". I want to plot these events on multiple scatter charts. For example, on the first scatter chart, i'll plot the first and second parameter of each event, so the (x, y) point for the first event is (242841.86914496083, 1090.0027001134586), for the 2nd event is (212841.86914496083, 128007.0027001134586) etc. On the second chart, i'll plot the second and third parameter of each event, so the (x, y) point for the first event is (11711.344635666988, 142639.20305005432), for the 2nd event is (11711.344635666988, 142639.20305005432) etc.

In the example Im giving the range is 0 to 262144;

The scatter charts are 200px x 200px but can be any pixel size up to 2000px. It's always square.

When plotting, I want to work out which points are within a polygon. Here is a sample polygon:

[
    [
        200000,
        100000
    ],
    [
        230000,
        180000
    ],
        [
        200000,
        200000
    ],
    [
        200000,
        230000
    ],
    [
        110000,
        220000
    ],
    [
        100000,
        180000
    ],
]

The points within the polygon are to be plotted in a different color to the points outside it. The polygon is to be applied to scatter chart 1 and 2, but it's the first and second parameter of each event that is to be used to figure out if the event is within the polygon and what color it should be on both charts. E.g. say the point for event 2 (212841.86914496083, 128007.0027001134586) are within the polygon, this mean that event will appear in red in each chart.

I have a simple algorithm which runs through each point and checks if it's within the polygon. I use https://www.npmjs.com/package/point-in-polygon.

Keep in mind, there are often multiple polygons (n), and these can be applied to different combinations of parameters. So I need to work out if an event is in all n polygons, or if its in n-1 polygons etc.

To sum up, what I need to know is:

  • percentage of total events within each polygon
  • number of events within a polygon
  • number of events within n, n-1 etc polygons
  • x value mean of event with polygon
  • y value mean of event with polygon

My algorithm performs well for up to 10,000 events. But then it hits issues. Some of my users have files that contain 1 millions events, and my algorithms take minutes.

How can I work out what I need more efficiently? Binning data is an option but then I lose accuracy in results. E.g. for a 200 by 200 pixel image I could create 200 bins for each axis. Problem is a then lose accuracy on the x and y mean values. Also, users can change to 201 x 201 pixel images, now I need to work out 201 bins etc.

After file upload, these files are on the server and can be pre-processed before graphs are generated. Is there anything that can be done mathematically to improve performance for large number of events?

EDIT:
To be clear – figuring out if a point is in a polygon is not the most performance intensive. The most performance intensive part is looping through 1 million events in order to plot them on a, say, 400px by 400px image. The solution, probably lies somewhere in binning data – but if I make 400 bins, then the user – on the fly – changes the image size to 410px, I now need to make 410 bins from the ORIGINAL dataset. Is there some way around this?

EDIT
So the data is known in advance. This sits on my server and can be manipulated in anyway. The size of the image and the polygons are now knows in advance – users draw the polygons and the image must change on the fly. With, say, 1 million events, this is not possible – the browser usually crashes…I can bin the data, but if I create 200 bins (standard image is 200 x 200), then the user makes the image bigger e.g. 210 x 210, now I have to cycle through 1 million events and create 210 bins, which is obviously a big performance issue

Best Answer

Here is an algorithm that I would generically expect to give you a huge performance boost. I can't quantify the performance boost without making additional assumptions (such as the probability distribution of the data points or the size/shape of the polygons). But in realistic cases I would expect it to be much faster.

I'm going to make the following assumptions:

  1. The polygons are not known in advance.
  2. The dataset is known in advance.
  3. The image resolution is not known in advance. (You mentioned the user changing it suddenly.)

The current problem is that if there are $p$ polygons and $n$ data points, then the naive approach (testing every data point and every polygon independently) scales like $O(pn)$, so it's terrible when $p$ and $n$ are both large. With the assumptions above, there isn't an obvious solution for slowing down the $p$ factor, but you can do something for the $n$ factor.

My suggestion basically amounts to binning information about the data, but in a resolution-independent way.

  1. Before the user runtime, put the data points into a k-d tree. This is a binary tree search structure that partitions the data into smaller and smaller geometric regions in a very generic way. (Here, $k$ is the dimensionality of the space the points are in, so $k=2$.)

  2. For every node in the tree, store two things:

    • The number of points belonging to that branch of the tree.
    • The bounding box of all of these points.
  3. At runtime, iterate over all polygons and over the nodes in the tree. For the node iteration, descend through the tree depth-first. At each node, check whether the bounding box of the node is completely inside the polygon, completely outside the polygon, or neither. (See step 4 for a suggested implementation.) Use the following to respond to each case:

    • Completely inside the polygon: In this case, count all of the points on this branch of the tree as being inside the polygon and stop descending further into this branch.
    • Completely outside the polygon: Stop descending further into this branch.
    • Neither: Keep descending and repeat this check.
  4. Suggested algorithm for checking containment of bounding box and polygon: For every edge $e$ in polygon $p$, check whether $p$ intersects any of the four sides of the bounding box. If any of them do, immediately put this in the "neither" category (there is partial overlap between the $p$ and the bounding box).

    If there are no intersections, then the bounding box must be completely inside or completely outside of the polygon. To decide which is which, you can check whether any point in the bounding box is inside or outside the polygon (which you already know how to do for a point).

    Three things to note about this item:

    • I am not an expert at computational geometry and this part can probably be improved.
    • I am assuming that you have a function that can check efficiently whether two line segments intersect. This is a problem that has been discussed elsewhere (for example, here).
    • When you get to a leaf node of the tree, then the node corresponds to a point rather than a rectangle, which is much simpler to check. (You might want to insert special handling for this; I'm not sure if it is worth it.)

The idea here is to prune off whole branches of the tree as soon as you possibly can. I expect that this will give a huge speedup immediately, but it's hard to prove this. If you don't see the speedup immediately and want to start investigating/profiling, then here are a few things to keep in mind.

  • Depth matters. Anything you can do to classify nodes further up the tree will help.
  • If the polygons have a lot of edges, then the algorithm I proposed in (4) might be slow, and there might be a better one.
  • The absolute worst-case scenario here is that you check all of the nodes in the tree. There are about twice as many nodes as points, and the check that I am proposing is more expensive than the check for a single point, so the behavior would actually get worse. But I cannot imagine this ever happening in a realistic case. (You would have to draw a polygon that looks like a really long and winding snake that passes very close to every point. Otherwise, there would be nodes that are classified as entirely inside or entirely outside of the polygon.)
Related Question