Large CSV Subsetting – How to Subset a Large CSV File Using Coordinates Columns in Python

csvgeopandasjupyter notebookpython

I have an extremely large CSV file (3GB) and want to choose a part of that based on columns(lat,lon) and do analysis on that. I can't load it into a desktop app like ArcGIS and select a region manually. I used GeoPandas for converting lat/lon columns into geometry objects and create points.

I tried:

df=pd.read_csv('file.csv')
points=df.apply(lambda row: point (row.LON, row.LAT), axis=1)

But the file is too big and Jupyter notebook gets stuck in this step (converting every row into geometry object).

Best Answer

You can read a CSV one line at a time, deal with that line, then either output it or skip to the next one.

Tutorials on reading CSV files in python using the csv module from the standard library are everywhere. First duckduckgo hit is:

https://thispointer.com/python-read-a-csv-file-line-by-line-with-or-without-header/

Here's a full working example. This will not read the whole file in, but it still might take a while to process a huge CSV. Its done in the blink of an eye on 100,000 records, a couple of seconds on 2 million records.

Given lines of CSV like this:

"x","y","N"
-21.8714018538594,-0.503809251822531,"Q"
96.0263147950172,-69.9362510349602,"U"
81.40420909971,-38.342371112667,"W"
....etc...

Then the following python:

#!/usr/bin/env python

import csv

def inbox(p):
    return float(p[0])>25 and float(p[0])<30 and \
        float(p[1])>17 and float(p[1])<19

def comma(L):
    return ",".join(L)

with open('pts.csv', 'r') as read_obj:
    csv_reader = csv.reader(read_obj)
    header = next(csv_reader)
    print(comma(header))
    for row in csv_reader:
        if inbox(row):
            print(comma(row))

will print out the rows where x is between 25 and 30, and y is between 17 and 19:

$ python filter.py 
x,y,N
25.6576492544264,17.61342425365,Q
27.0452616736293,17.7616416150704,K
25.6611929554492,18.5779095254838,X
27.1376479323953,17.3779100412503,V
28.8249599095434,18.005088083446,S

If you want to write this to another file, either redirect the output or use standard python file writing (well-documented elsewhere).