Python GeoDataFrame – Selecting Points Based on Random Index Location from Filtered Data

geopandaspointpythonsamplingvector

I have a large point vector file (and many others like it), where each point has a integer value in the column "DN". However, I only need 6 points per DN and want them to be selected randomly. As a final output, I need one csv that contains 6 points per DN value.

Thus far, I have been able to successfully print out the number of points in each DN value, and generate a list of 6 random numbers that I want to use for selecting each grouping of DN value points:

import geopandas as gpd
import random

# Importing point file
point_df = gpd.read_file('/home/file_path/file.geojson')

# List of known target DN values within my point file
dn_values_list = [0, 3, 6, 8, 11, 14, 17, 19, 22, 25, 28, 31, 33, 36, 39, 42, 44, 47, 50, 53, 56, 58, 61, 64, 67, 69, 72, 75, 78, 81, 83, 86, 89, 92, 94, 97]

for dn_value in dn_values_list:
    # Filtering point file by DN value
    filtered_point_df = point_df[point_df['DN'].isin([dn_value])]
    print(str(dn_value) + " DN:", str(len(filtered_point_df)) + " points")

    # Generating random list of 6 row numbers to use for each DN value
    random_list = random.sample(range(1, (len(filtered_point_df))), 6)
    random_list.sort()
    print(dn_value, random_list)

However, things have gotten a bit trickier now that I am trying to select and export the 6 rows from each DN value to put into a CSV. I technically need two filters: the first one being on the DN value (achieved already!) and the second being the position decided within the random number list (still need help with). I've tried it with a nested for loop but am running into errors:


    for random_integer in random_list:
        selected_point_gdf = filtered_point_df[random_integer]
        selected_point_gdf.to_csv('test.csv', index=False)

How do I select the random 6 rows for each DN value, add it to a GDF, and export to csv with lat/long or geojson?

Best Answer

Use groupby and sample:

import geopandas as gpd
import os

output_folder = r"/home/bera/Desktop/GIStest/csvs/"
df = gpd.read_file(r"/home/bera/Desktop/GIStest/10k_points_wgs84.shp")
  
#Calculate lat and long columns
df["lat"] = df.apply(lambda x: x.geometry.y, axis=1)
df["lon"] = df.apply(lambda x: x.geometry.x, axis=1)

sample_size = 6

for dn, subframe in df.groupby("DN"): #For each DN value.
    #dn variable is now the value of dn, and subframe is a dataframe with all rows with that dn value
    print(dn)
    filename = f"DN_{dn}.csv" #Create an output filename
    filename = os.path.join(output_folder, filename)
    subframe.sample(n=sample_size).to_csv(filename, sep=";") 

enter image description here

If you want all samples in one file you can use concat:

samples = [] #A list to hold each sample data frame
for dn, subframe in df.groupby("DN"): #For each DN value.
    samples.append(subframe.sample(n=sample_size))
result = gpd.pd.concat(samples) 
#result.to_csv...