Output an attribute table containing reformatted text from an input table

attribute-tableformattingqgis

At 3.34.1, I have a Geopackage attribute table containing a single text column with the following form:

Alaska_2021_05_01_to_2021_05_03
Hawaii_2022_11_15_to_2022_11_16

As you can see, each record contains three parts: location, starting date, ending date.

I need to output this to a .csv table, containing two columns, as follows (columns are separated here for clarity):

column1,   column2
Alaska_2021_05_01_to_2021_05_03,   2021_05_01
Alaska_2021_05_01_to_2021_05_03,   2021_05_02
Alaska_2021_05_01_to_2021_05_03,   2021_05_03
Hawaii_2022_11_15_to_2022_11_16,   2022_11_15
Hawaii_2022_11_15_to_2022_11_16,   2022_11_16

Here,

  • There are as many records as there are elapsed days in each input record.
  • The first column is simply a repeat of the input table's value.
  • The second column lists the date sequence, from starting to ending date.

Using QGIS date expressions, I am able to determine the number of elapsed days, which I presume could be used to calculate the number of output records. However, that is the limit of my expressions knowledge. A review of the QGIS docs, plus Youtube, did not provide help.

How do I output a table containing the reformatted text as shown?

EDIT:

I modified @Ben W's Python code to accommodate location names containing one or more underscores (such as New_York), as follows:

  • I commented out his line:
    location = attr.split('_')[0]
  • and replaced it with the following two lines:
    index_position = (attr.index('_to_')- 11)
    location = attr[0:index_position]

Here’s how the two lines work:

  1. As each input record is iterated through, the first code line finds the numeric index position of the string ‘_to_’ (ignore apotrophes). Because this string is in every input line, and is the only consistently unique string for every record, it serves as an identifying reference. The index value is then reduced by 11 to account for the preceeding string ‘_yyyy_MM_dd’. The reduced value, stored in the variable index_position, represents the ending position of the location name.
  2. The second line of code left-trims the input (attr) by selecting the number of characters derived from index_position, and outputs the selected text (such as New_York) to the variable location, which is then sent to the output file. Location can now contain one or more underscores.

Best Answer

I would use PyQGIS along with Python's datetime and csv modules for this. Copy and paste the script below into a new editor in the Python console. You will just need to edit the output path for the CSV file and the name of the text field in your GeoPackage (see comments in the code), then select your gpkg as the active layer and run the script.

from datetime import datetime, timedelta

import csv

# Edit the line below to the path where you want your output csv written
output_csv = 'Path/To/Attributes.csv'

# Edit below to match the name of the text field in your GeoPackage
fld_name = 'Your Field Name'

lyr = iface.activeLayer()

csv_tbl = open(output_csv, mode='w', newline='')
writer = csv.writer(csv_tbl)
writer.writerow(['Column1', 'Column2'])

for ft in lyr.getFeatures():
    attr = ft[fld_name]
    start_date_txt = attr.split('to')[0][-11:][:-1]
    start_date = datetime.strptime(start_date_txt, '%Y_%m_%d').date()
    writer.writerow([attr, start_date_txt])
    end_date_txt = attr.split('to')[1][1:]
    end_date = datetime.strptime(end_date_txt, '%Y_%m_%d').date()
    gaps = end_date - start_date
    gap_days = gaps.days-1
    day_count = 1
    for i in range(gap_days):
        fill_date = start_date+timedelta(days=day_count)
        fill_date_txt = datetime.strftime(fill_date, '%Y_%m_%d')
        writer.writerow([attr, fill_date_txt])
        day_count+=1
    writer.writerow([attr, end_date_txt])
    
csv_tbl.close()
del writer

print('Done')

I tested with this dummy geopackage (one string field containing the information).

enter image description here

After running the above script (shown here in the Python console)

enter image description here

The output CSV file was produced:

enter image description here

Edit:

To adapt the script to write the location to a third column, we just need to extract it from the attribute string. In this case we can simply split the string with the underscore character and access the first element (index 0) of the list returned by the split() method. We then add it to the list of values passed to each writerow() call on the csv writer object.

I have also added comments to the adapted code to help you step through the logic. For what it's worth, the string manipulation could probably be done more 'slickly' with Regexp but I'm far from competent with Regexp so here we are!

from datetime import datetime, timedelta

import csv

# Edit the line below to the path where you want your output csv written
output_csv = 'Path/To/Attributes.csv'

# Edit below to match the name of the text field in your GeoPackage
fld_name = 'Your Field Name'

# Get the active layer
lyr = iface.activeLayer()

# Open the CSV table for writing
csv_tbl = open(output_csv, mode='w', newline='')

# Create a csv writer object
writer = csv.writer(csv_tbl)

# Write the column headers
writer.writerow(['Column1', 'Column2', 'Column3'])

# Iterate over the geopackage layer features
for ft in lyr.getFeatures():
    # Get the features value from the relevant field
    attr = ft[fld_name]
    # Split out the location part of the string
    location = attr.split('_')[0]
    # Split out the start date as a string using split() plus string slicing
    start_date_txt = attr.split('to')[0][-11:][:-1]
    # convert the start date text to an actual Python date object
    start_date = datetime.strptime(start_date_txt, '%Y_%m_%d').date()
    # Write row to csv
    writer.writerow([attr, start_date_txt, location])
    # Split out end date as a string
    end_date_txt = attr.split('to')[1][1:]
    # Convert to Python date object
    end_date = datetime.strptime(end_date_txt, '%Y_%m_%d').date()
    # Subtract start date from end date to calculate difference in days
    gaps = end_date - start_date
    # Subtract one day to get number of dates to fill in
    gap_days = gaps.days-1
    # Initialize day counter
    day_count = 1
    # Create the number of new date objects required adding 1 day on each iteration
    for i in range(gap_days):
        fill_date = start_date+timedelta(days=day_count)
        # Convert the date back to a string separated by underscores
        fill_date_txt = datetime.strftime(fill_date, '%Y_%m_%d')
        # Write row to csv
        writer.writerow([attr, fill_date_txt, location])
        # Increment the day counter
        day_count+=1
    # Write row to csv
    writer.writerow([attr, end_date_txt, location])

# Close the csv table and delete the writer object to release the csv file
csv_tbl.close()
del writer

print('Done')