GeoPandas – Groupby Function Omitting Columns in GeoPandas

geopandaspandaspython

I am using this answer to calculate some basic statistics of some points that fall within the bounds of a polygon (a vector grid), such that:

gridfile = 'grid.shp'
pointfile = 'points.shp'

point = gpd.GeoDataFrame.from_file(pointfile)

poly  = gpd.GeoDataFrame.from_file(gridfile)

pointInPolys = sjoin(point, poly, how='left')

grouped = pointInPolys.groupby('index_right')['X','Y','Z'].agg(['mean'])

grouped.columns = ["_".join(x) for x in grouped.columns.ravel()]

The input point data has X, Y and Z columns. However, it is only returning statistics (mean) for X and Y and the stats for the Z column are not being returned:

            X_mean     Y_mean
index_right                      
1221        -64.781242  32.439396
1902        -64.781206  32.439096
2412        -64.781169  32.438777

The data is definitely available in the prior step by checking:

pointInPolys.keys()

Index(['X', 'Y', 'Z', 'geometry', 'index_right', 'DN'], dtype='object')

Is there a reason why the Z column stats are not being calculated?

Best Answer

There must be some non-float data in your Z column. Probably some "NULL", "NAN" or "". This renders the "mean" aggregator useless.

I Created a gist with a minimum working example (using csv data) of how geopandas works just fine with real np.nan nulls but drops the column if there are "NaN" strings on it. Geopandas won't apply the mean agg to columns of non numeric type (i.e: object columns). See it Here: https://gist.github.com/jjclavijo/8b8b44fd944c9698a0c4f4a58637748b

To solve this, after reviewing your data, you can safely cast your data to float, thus every non-float data will be converted to np.nan.

grouped = pointInPolys.groupby('index_right')[['X','Y','Z']].agg(['mean'])
grouped

>>>
                X             Y
index_right     mean            mean    
0               -5.923750   -4.268750
1               32.738333   2.204000
2               32.669667   -5.528667

pointInPolys.loc[:,'Z'] = pointInPolys.Z.astype(float)
grouped = pointInPolys.groupby('index_right')[['X','Y','Z']].agg(['mean'])

   X            Y           Z
   mean          mean         mean          
0   -5.923750   -4.268750   609.49575
1   32.738333   2.204000    645.05100
2   32.669667   -5.528667   483.71250

Related Solutions

[GIS] GeoPandas GeoDataFrame plot statistics – how

The GeoPandas plotting methods are there for convenience. They do override the standard pandas plotting methods. For now, the easiest way to get access to basic pandas plotting is probably through their functional versions, something like this:

from pandas.tools.plotting import plot_frame
plot_frame(mygeodataframe, kind='scatter', ...)

Feel free to create an issue about this on the GitHub repository (or better yet, help implement a solution).

[GIS] Geopandas performance appears quite slow

You probably use an index in your database. You don´t use one in python with your code. (modul rtree might help http://geoffboeing.com/2016/10/r-tree-spatial-index-python/). This might be a big issue depending on your geometries. Do many points fall into your buffers? You can try to stop the times for each step to see where the time is spent. I guess it will be in the distance < 402 part.

The second thing is that geopands is quite new. Not sure how they implement the functions. Usually it is a wrapper around some C stuff as otherwise python is really slow. PostGIS is a bit older and therefore had more time for refactoring and runs entirely in C. Also the way databases are working (memory pages on row level) is optimized for speed when searching for rows (objects).

Best Answer

Related Solutions

[GIS] GeoPandas GeoDataFrame plot statistics – how

[GIS] Geopandas performance appears quite slow

Related Question