Specifying dtype of columns when reading in data with GeoPandas

geopandaspython

I have an ESRI geodatabase with the following attribute table structure (just a small toy example, my geodatabase consists of several milllion features):

esri_gdb
       UID  Value                 geometry
0  P1_2021   1.01  POINT (1.00000 2.00000)
1  P2_2024   2.52  POINT (2.00000 1.00000)
2  P3_2035   3.24  POINT (3.00000 5.00000)

The first column of the attribute table (UID) contains strings (dtype object) and the second column (Value) is of dtype float64.

esri_gdb.info(verbose=True, memory_usage='deep')

<class 'geopandas.geodataframe.GeoDataFrame'>
    RangeIndex: 3 entries, 0 to 2
    Data columns (total 3 columns):
     #   Column    Non-Null Count  Dtype   
    ---  ------    --------------  -----   
     0   UID       3 non-null      object  
     1   Value     3 non-null      float64 
     2   geometry  3 non-null      geometry
    dtypes: float64(1), geometry(1), object(1)
    memory usage: 368.0 bytes

I would like to convert the columns UID to categorical and the column Value to float32 in order to use less memory (please remember: I have several million features in my data!). I can convert the dtype of a column like this:

# read in file
esri_gdb_path = r'/MyProject/Data.gdb/somedata'
esri_gdf = read.file(esri_gdb_path)

# change dtype of column UID from string (object) to category
esri_gdf.UID = esri_gdf.UID.astype('category')

# change dtype of column Value from float64 to float32
esri_gdf.Value = esri_gdf.Value.astype('float32')

Is there a way to directly change the dtype of the columns when reading in the data in GeoPandas?

Specifying the dtype option as in pandas (example see here) and passing a dictionary with the dtypes seems to have no effect, the dtypes stay the same.

# read in file
esri_gdb_path = r'/MyProject/Data.gdb/somedata'
dtype_dict = {'UID':'category', 'Value':'float32', 'geometry':'geometry'}
esri_gdf = gpd.read_file(esri_gdb_path, dtype=dtype_dict)

Best Answer

I believe the answer to this is "No". There is currently no way to specify the dtype of the columns when reading in the data in GeoPandas. From looking at the source code it seems clear there is no hook for it.

It's worth understanding why this is not possible. GeoPandas tries to align with Pandas behaviours in many ways. So, why not this? It's because the file-reading operations are not implemented in GeoPandas but in underlying Python libraries such as Fiona. Those underlying libraries are responsible for creating/iterating the data structures, but they don't provide dtype operations because they're not specialised for Pandas or even numpy.

If you have difficulties loading the full data file in GeoPandas, one work-around, if your data are in tabular format, is: (a) load the data using plain Pandas and then (b) join the geometries on (using GeoPandas) after you've done initial preprocessing.

Related Solutions

[GIS] Reading raw data into geopandas

You can pass the json directly to the GeoDataFrame constructor:

import geopandas as gpd
import requests
data = requests.get("https://data.cityofnewyork.us/api/geospatial/arq3-7z49?method=export&format=GeoJSON")
gdf = gpd.GeoDataFrame(data.json())
gdf.head()

Outputs:

                                            features               type
0  {'type': 'Feature', 'geometry': {'type': 'Poin...  FeatureCollection
1  {'type': 'Feature', 'geometry': {'type': 'Poin...  FeatureCollection
2  {'type': 'Feature', 'geometry': {'type': 'Poin...  FeatureCollection
3  {'type': 'Feature', 'geometry': {'type': 'Poin...  FeatureCollection
4  {'type': 'Feature', 'geometry': {'type': 'Poin...  FeatureCollection

For supported single-file formats or zipped shapefiles, you can use fiona.BytesCollection and GeoDataFrame.from_features:

import requests
import fiona
import geopandas as gpd

url = 'http://www.geopackage.org/data/gdal_sample.gpkg'
request = requests.get(url)
b = bytes(request.content)
with fiona.BytesCollection(b) as f:
    crs = f.crs
    gdf = gpd.GeoDataFrame.from_features(f, crs=crs)
    print(gdf.head())

and for zipped shapefiles (supported as of fiona 1.7.2)

url = 'https://www2.census.gov/geo/tiger/TIGER2010/STATE/2010/tl_2010_31_state10.zip'
request = requests.get(url)
b = bytes(request.content)
with fiona.BytesCollection(b) as f:
    crs = f.crs
    gdf = gpd.GeoDataFrame.from_features(f, crs=crs)
    print(gdf.head())

You can find out what formats Fiona supports using something like:

import fiona
for name, access in fiona.supported_drivers.items():
    print('{}: {}'.format(name, access))

And a hacky workaround for reading in-memory zipped data in fiona 1.7.1 or earlier:

import requests
import uuid
import fiona
import geopandas as gpd
from osgeo import gdal

request = requests.get('https://github.com/OSGeo/gdal/blob/trunk/autotest/ogr/data/poly.zip?raw=true')
vsiz = '/vsimem/{}.zip'.format(uuid.uuid4().hex) #gdal/ogr requires a .zip extension

gdal.FileFromMemBuffer(vsiz,bytes(request.content))
with fiona.Collection(vsiz, vsi='zip', layer ='poly') as f:
    gdf = gpd.GeoDataFrame.from_features(f, crs=f.crs)
    print(gdf.head())

[GIS] How to correctly reproject a geodataframe with multiple geometry columns

In case of GeoDataFrame, CRS in GeoPandas is stored on the level of GeoDataFrame, not individual GeoSeries (as of version 0.7.0, there is a discussion to change it). At this moment, I think that your solution of reprojecting GeoSeries and then assigning then to GeoDataFrame is the best solution, although admittedly not very elegant. Feel free to express your thoughts on it on GitHub: https://github.com/geopandas/geopandas/issues/1193

Best Answer

Related Solutions

[GIS] Reading raw data into geopandas

[GIS] How to correctly reproject a geodataframe with multiple geometry columns

Related Question