A data manipulation I commonly need to perform involves creating a time series by aggregating a quantity which was sampled over many overlapping time ranges.
For example, consider the following contrived data on movie times and attendance at a movie theater:
| Movie ID | Movie Start Time | Movie End Time | Attendance |
|----------+------------------+----------------+------------|
| Movie 1 | 0:00 | 2:00 | 30 |
| Movie 2 | 1:00 | 3:00 | 40 |
Treating all time intervals as half-closed on the left, like [Start time, End time)
, I'd like to compute the total attendance at the theater as a time series, i.e.,
| Time | Total Attendance |
|------+------------------|
| 0:00 | 30 |
| 1:00 | 70 |
| 2:00 | 20 |
| 3:00 | 0 |
What is this type of manipulation called? Is there a way to do this efficiently, preferably in a Python/pandas environment?
Best Answer
There is probably something that would work like 1000 times more pythonic than this solution, but it should get you to where you need to go.
The real trick is that for this to work you need to be able to transform your start/end times to actual python datetimes. And the resulting dataframe looks like this: