Solved – Best way to aggregate a set of observations over overlapping time ranges into a time series

data transformationtime series

A data manipulation I commonly need to perform involves creating a time series by aggregating a quantity which was sampled over many overlapping time ranges.

For example, consider the following contrived data on movie times and attendance at a movie theater:

| Movie ID | Movie Start Time | Movie End Time | Attendance |
|----------+------------------+----------------+------------|
| Movie 1  |             0:00 |           2:00 |         30 |
| Movie 2  |             1:00 |           3:00 |         40 |

Treating all time intervals as half-closed on the left, like [Start time, End time), I'd like to compute the total attendance at the theater as a time series, i.e.,

| Time | Total Attendance |
|------+------------------|
| 0:00 |               30 |
| 1:00 |               70 |
| 2:00 |               20 |
| 3:00 |                0 |

What is this type of manipulation called? Is there a way to do this efficiently, preferably in a Python/pandas environment?

Best Answer

There is probably something that would work like 1000 times more pythonic than this solution, but it should get you to where you need to go.

import pandas as pd

#Recreating your example data. Note the addition of dates. I'm assuming you really have timestamps in your data.
df=pd.DataFrame()
df['Movie ID']=[1,2]
df['Start']=["06/01/2017 0:00","06/01/2017 1:00"]
df['End']=["06/01/2017 2:00","06/01/2017 3:00"]
df['Attendance']=[30,40]
df['Start']=pd.to_datetime(df['Start'])
df['End']=pd.to_datetime(df['End'])

#This is what actually does what you need it to do.
df1=df.copy()
df2=df.copy()
df1.index=df1['Start']
df2.index=df2['End']
df_final=df1.groupby(pd.TimeGrouper('h')).sum()['Attendance'].fillna(0).subtract(df2.groupby(pd.TimeGrouper('h')).sum()['Attendance'].fillna(0),fill_value=0).cumsum()

#Just displaying the resultant dataframe.
print(df_final)

The real trick is that for this to work you need to be able to transform your start/end times to actual python datetimes. And the resulting dataframe looks like this:

2017-06-01 00:00:00    30.0
2017-06-01 01:00:00    70.0
2017-06-01 02:00:00    40.0
2017-06-01 03:00:00     0.0
Freq: H, Name: Attendance, dtype: float64
Related Question