ArcGIS Grid Aggregation – How to Superimpose Grid on Points and Aggregate Points Within Each Grid Cell

arcgis-10.0r

I have a vector of presence/absence values (1,0) and a matrix of x,y coordinate data (projected) specifying the spatial position of each value (about 25,000 in total). I'd like to aggregate the point data into 'cells' and calculate percentage values (% of presence) for each cell. However, I want the cells to be arbitrarily positioned with respect to the point data (i.e., not determined by census tract or any other factor), and with a fixed uniform size (0.5 square kilometers each).

So essentially, I'd like to superimpose a fixed-size grid over my point distribution and then calculate the percentage of 1s (presence) in each cell of the grid. I then want to repeat the proceedure many times, each time randomly altering the position of the grid with respect to the points, and recalculating the percentages of each cell. Ultimately, I will perform localized spatial tests on the percentage data.

I realize there are probably multiple ways to approach this using many tools. I have some familiarity with R and the 'spdep' package and basic famililarity with ArcGIS. I'd be grateful if someone could at least point me in the right direction for R or ArcGIS functions that may be useful.

Best Answer

There are indeed multiple ways within both ArcGIS and R. I'll sketch the two classes of solution, vector and raster.

Vector solution

A grid is determined by an origin O (which is a point in the plane, because gridding makes sense primarily in projected coordinate systems) and two basis vectors e and f: the vertices of the grid cells consist of points of the form o + i * e + j * f for integers i and j. Corresponding to any basis (e, f) is a pair of covectors (eta, phi) which are dual to the basis in the sense that

eta( e ) = 1, eta( f ) = 0

phi( e ) = 0, phi( f ) = 1.

(Their coefficients are found by inverting the matrix of coefficients of e and f.) The utility of using covectors derives from the easily deducible fact that they extract the coefficients of e and f from any point. To see this, consider any point in the grid coordinate system, which necessarily will be of the form

P = O + x e + y f, whence

eta( P ) = eta( O + x e + y f ) = eta( O ) + x eta( e ) + y eta( f )= eta( O ) + x * 1 + y * 0 = eta( O) + x

and similarly

phi( P ) = ... = phi( O ) + y.

Therefore, if we compute x0 = eta( O ) and y0 = phi( O ) once and for all, we can extract the coordinates x and y as

x = eta( P ) - x0,

y = phi( P ) - y0.

The floors of x and y give integers corresponding to the grid cell in which P lies. Therefore, you can do the following:

For each of your points P, compute i = Floor(eta( P ) - x0) and j = Floor(phi( P ) - y0). These are field calculations that can be performed in any database system (including a spreadsheet). They use only multiplication, addition, and the floor (or greatest integer) function.
Sum the 0/1 attribute using the concatenation of i and j as the key.
Also, in the same way as (2), count the number of points having each unique combination (i, j).
Divide the result of (2) by that of (3) (another field calculation).

To iterate, change the origin and, optionally, eta and phi. If you want to use square grids of cellsize c, then

Randomly generate u0 between 0 and c and, independently of that, generate v0 between 0 and c. Set O = (u0, v0).
Randomly generate an angle alpha between 0 and 90 degrees.
Set eta = (cos( alpha ) / c, sin( alpha ) / c). Let's call these coordinates (u, v).
Set phi to be perpendicular to eta and of the same length; a good choice (one of two) is

phi = (-v, u).

Here is an example. Suppose the cellsize is c = 30. Let the grid origin be randomly set to O = (13.2, 29.1) and suppose you don't want to rotate the grid, so you fix alpha = 0. Then

eta = (1/30, 0). so phi = (0, 1/30).

Eta contours

Phi contours

The upper map shows contours of eta (specifically, contours of z = eta(x,y) = x/30). The lower map shows contours of phi (specifically, contours of z = phi(x,y) = y/30). This illustrates how one of the covectors determines the columns and the other covector determines the rows of a grid.

From these you compute

x0 = eta( O ) = 1/30 * 13.2 + 0 * 29.1 = 13.2/30;

y0 = phi( O ) = 0 * 13.2 + 1/30 * 29.1 = 29.1/30.

Consider a point P = (400000, 3543220) (which could be in UTM coordinates for instance). Its cell is indexed by

i = Floor(1/30 * 400000 + 0 * 3543220 - 13.2/30) = 13332 and

j = Floor(0 * 400000 + 1/30 * 3543220 - 29.1/30) = 118106.

You could create a single key out of this pair by concatenating them with a delimiter, such as ",", producing the string "13333,118106".

Grid

The point P is white. The grid is obtained by overlaying the first two maps transparently: together, the contours of the two covectors describe the cells of the grid. Changing the grid merely requires changing its origin O and/or the basis covectors eta and phi, then recalculating the rows and columns of each point P.

Raster solution

The idea is the same as the vector solution, but you exploit the raster capabilities of the GIS. This makes rotation difficult, but you can arbitrarily change the origin in ArcGIS simply by changing the "Analysis Environment" for Spatial Analyst.

There are problems, though. What you would like to do in each iteration is

Convert the point data to a grid (having 0/1 values in the cells).
Perform an "aggregate" operation to summarize the values by groups of cells.

In (1), you need to use a cellsize so small that no two points end up in the same cell. This restriction is severe if there are clusters of points, but it is forced on us by ArcGIS limitations. If you use a cellsize that is an integral fraction of the aggregate size--such as 3 m when 30 m is needed for the aggregation--then everything will work out beautifully.

Comments

When the points are densely spaced throughout the grid extent but do not get very close to each other, the raster solution can be faster than the vector one. However, in general the vector solution will be more flexible (you can rotate the grids and there's no problem with coincident points), even though it may be a little slower in execution. The calculations are readily scripted in ArcGIS or in R. (It may be preferable to stay within ArcGIS if you're already set up to do that.)

Related Solutions

[GIS] How to determine in which grid cell each coordinate of an x,y series is located

To me, the simplest approach is to probably convert your XY datapoints to the polar coordinate system that defines your circular 'arena'.

Be sure to convert your XY coordinates such that the center of your circle is the origin of your Cartesian grid before converting to polar coordinates. Almost all math texts would provide these straightforward conversion equations, but this Wikipedia entry would definitely suffice.

Each of your equal area sectors are bounded by specific values for R and theta in the polar coordinate system.

Once you convert your Cartesian coordinates, simply round the calculated polar coordinates for each of your points to the nearest R and theta values that define your arena sectors.

This solution may be 'inelegant', but it strikes me as being about as straightforward as you could get.

Raster – How to Rasterize Point Data into a Grid Ensuring One Point Per Cell

as mentioned, find your smallest distance between points using dist (against a matrix of xy) and use this as the cell diagonal, to derive the resolution. This bit is important (see later)

# create some random data
df <- data.frame(x=sample(0:100,25),y=sample(0:100,25), z1 = sample(1:10,25,replace = T), z2 = sample(100:1000,25,replace = T))

# use 'dist' to create a matrix of Euclidean distances between points. 
# Must be done on an xy matrix/data.frame otherwise z will be used as height
dis <- dist(df[,c(1,2)])

# find your minimum distance between points, this is your cell diagonal
diag <- min(dis)

# convert the cell diagonal to resolution (trig)
res <- sqrt((diag^2)/2)

# convert data frame of points to points, easier to rasetrize as non-regular
pts <- df
coordinates(pts) <- ~x+y

# create a template raster:
# the extent covers the point extent
# the resolution is set by the min distance as derived from the dist matrix
r <- raster(ext=extent(pts),res=res)

# create a blank stack and loop through your attributes
st <- stack()
for(z in names(pts)){
  rasOut<-rasterize(pts, r, z)
 st <- stack(st,rasOut) 
}

The interesting thing about using the min distance between any 2 points as the cell diagonal, and not the resolution, is the effect it has on cell size. The cell diagonal is always 41.4% bigger than the side resolution so you shrink the cell size quite significantly (by exactly half in terms of area), which is quite a lot.

However if you use the min distance as the resolution, you may, on rare occasion, end up with 2 points in one cell. Consider this toy e.g.;

# data frame, just a few points 
df <- data.frame(x=c(-2.2,1,4,8,10),y=c(1,5,4,7,2), z = sample(1:10,5,replace = T))

# distance matrix
dis <- dist(df[,c(1,2)])

# set the cell resolution as the min distance between 2 points
res <- min(dis)

# convert to spatial points
pts <- df
coordinates(pts) <- ~x+y

# create blank raster
r <- raster(ext=extent(pts),res=res)

# rasterize
rasOut <- rasterize(pts, r, pts$z)

# plot and see the 2 points in 1 cell
plot(rasOut)
plot(pts,add=T)

Just as a quirk of the extent + resolution, the cell size is now not big enough to guarantee all points are in their own cell, due to the fact min distance has been used as the resolution.

I have to admit when i ran the 2nd example a number of times with a greater number of random points etc, i don't think i ever saw this theoretical outcome reproduced but the point is, it can occur. So it is better to use the 1st approach.

N.B. 1) I'm guessing your points cannot be coincident? otherwise you've got to decide on how to work these data points. 2) You might want to extend your extent so some points dont lie right on the edges but that's up to you