MATLAB: Filtering max values within every 3 hours of data

cell arraysfiltermaximum

Hey everyone,
I attached the first 50 rows of a sample dataset. My time data is in the format of '2000-01-01 01:00' where the last 5 characters hold the hour (HH) and minute (MM) values respectively. I would like to have a code that goes through the time column, and assigns the maximum value of the second column between the 01,02,03 hours to the 03 hour timestep. I need to then have this done for the 04 – 06 time period and so forth until 22 – 00.
At the end the cell array would not need to keep data from other columns and should just be XXXX x 2 cell array and just reassigns the maximums to the 03,06,09 …… 00 timesteps. There may also be missing hours, so the script would need to accommodate for that.
example:
'1984-01-01 00:00' 10.80 240 20.06 231 0.8
'1984-01-01 01:00' 11.83 230 20.06 6 0.6
'1984-01-01 02:00' 11.31 230 18.51 100 0.6
'1984-01-01 03:00' 10.80 230 18.51 212 0.7
'1984-01-01 04:00' 9.77 240 18.65 348 0.8
'1984-01-01 05:00' 9.25 240 16.97 406 0.8
'1984-01-01 06:00' 9.77 230 17.48 524 0.7
'1984-01-01 07:00' 9.77 230 16.97 612 0.7
would then become:
'1984-01-01 03:00' 11.83
'1984-01-01 06:00' 9.77
I am completely lost in writing this and any help would be appreciated!

Best Answer

You could build an approach along the following line:
% - Build array of time components.


dv = datevec( datatoupload(:,1), 'yyyy-mm-dd hh:MM' ) ;
% - Build logical index of relevant/boundary hours.
hrId = rem( dv(:,4), 3 ) == 0 ;
hrId(end) = true ; % Last entry always relevant.
% - Build vector of (time-) group IDs.


hrGrId = cumsum( [true; hrId(1:end-1)] ) ;
% - Get statistics per group.


grMax = accumarray( hrGrId, cell2mat(datatoupload(:,2)), [], @max ) ;
% - Build output.


output = [datatoupload(hrId,1), num2cell(grMax)] ;
Applied to your MAT file, this gives:
>> output
output =
'1984-01-01 00:00' [10.8024]
'1984-01-01 03:00' [11.8312]
'1984-01-01 06:00' [ 9.7736]
'1984-01-01 09:00' [ 9.7736]
'1984-01-01 12:00' [ 8.7448]
'1984-01-01 15:00' [ 8.7448]
'1984-01-01 18:00' [ 9.2592]
'1984-01-01 21:00' [ 7.2016]
'1984-01-02 00:00' [ 5.1440]
'1984-01-02 03:00' [ 4.6296]
'1984-01-02 06:00' [ 3.6008]
'1984-01-02 09:00' [ 2.5720]
'1984-01-02 12:00' [ 7.2016]
'1984-01-02 15:00' [10.2880]
'1984-01-02 18:00' [11.8312]
'1984-01-02 21:00' [12.3456]
'1984-01-03 00:00' [10.2880]
'1984-01-03 01:00' [10.8024]
EDIT 1: the following solution should work with missing data, but you'll have to understand it and fully test it. In the beginning, I am building a test data set with missing data.
load( 'masao_1.mat' ) ;
data = datatoupload ;
% - Remove rows so for day 1:

% > Hours 4,5,6 have only data for hours 5,6.

% > Hours 10,11,12 have only data for hours 10, 11.

% > Hours 13,14,15 have no data.

data([16,15,14,13,5], :) = [] ;
% - Build array of time components.
dv = datevec( data(:,1), 'yyyy-mm-dd hh:MM' ) ;
% - Build vector of (time-) group IDs.
% 1. Absolute hour ID.

hrId = dv(:,4:-1:1) * cumprod( [1; 24; 32; 12] ) - 1 ;
% 2. Contiguous group start pos.

grStart = [true; diff(floor(hrId/3)) ~= 0] ;
% 3. Group start pos.
grEnd = [grStart(2:end); true] ;
% 4. Contiguous group IDs.
grId = cumsum( grStart ) ;
% - Get statistics per group.
grMax = accumarray( grId, cell2mat(data(:,2)), [], @max ) ;
% - Build output.
output = [data(grEnd,1), num2cell(grMax)] ;
Looking at the output
>> output
output =
'1984-01-01 00:00' [10.8024]
'1984-01-01 03:00' [11.8312]
'1984-01-01 06:00' [ 9.7736]
'1984-01-01 09:00' [ 9.7736]
'1984-01-01 11:00' [ 8.7448]
'1984-01-01 18:00' [ 9.2592]
'1984-01-01 21:00' [ 7.2016]
'1984-01-02 00:00' [ 5.1440]
'1984-01-02 03:00' [ 4.6296]
'1984-01-02 06:00' [ 3.6008]
'1984-01-02 09:00' [ 2.5720]
'1984-01-02 12:00' [ 7.2016]
'1984-01-02 15:00' [10.2880]
'1984-01-02 18:00' [11.8312]
'1984-01-02 21:00' [12.3456]
'1984-01-03 00:00' [10.2880]
'1984-01-03 01:00' [10.8024]
you see that it seems to be working when missing data hit boundary hours or not, and when they cover full period(s).
EDIT 2: the following solution removes data from groups/periods which don't include 3 entries/measures. Again, you have to fully understand it and test it. You may find an easier way to eliminate entries, I didin't really optimize.
load( 'masao_1.mat' ) ;
data = datatoupload ;
% - Remove rows so for day 1:
% > Hours 4,5,6 have only data for hours 5,6.
% > Hours 10,11,12 have only data for hours 10, 11.
% > Hours 13,14,15 have no data.
data([16,15,14,13,5], :) = [] ;
% - Build array of time components.
dv = datevec( data(:,1), 'yyyy-mm-dd hh:MM' ) ;
% - Build vector of (time-) group IDs.
% 1. Absolute hour ID.
hrId = dv(:,4:-1:1) * cumprod( [1; 24; 32; 12] ) - 1 ;
% 2. Contiguous group start pos.
grStart = [true; diff(floor(hrId/3)) ~= 0] ;
% 3. Contiguous group IDs, 1st shot, any size taken into account.
grId = cumsum( grStart ) ;
% 4. Flag data which do not belong to a full group (= 3 entries).
grIsValid = accumarray( grId, ones(size(grId)) ) == 3 ;
entryIsValid = ismember( grId, find(grIsValid) ) ;
% 5. Eliminate invalid entries, rebuild valid group starts and IDs.
data(~entryIsValid,:) = [] ;
grStart(~entryIsValid,:) = [] ;
grId = cumsum( grStart ) ;
% - Get statistics per group.
grMax = accumarray( grId, cell2mat(data(:,2)), [], @max ) ;
% - Build output.
grEnd = [grStart(2:end); true] ;
output = [data(grEnd,1), num2cell(grMax)] ;
With that, we obtain..
>> output
output =
'1984-01-01 03:00' [11.8312]
'1984-01-01 09:00' [ 9.7736]
'1984-01-01 18:00' [ 9.2592]
'1984-01-01 21:00' [ 7.2016]
'1984-01-02 00:00' [ 5.1440]
'1984-01-02 03:00' [ 4.6296]
'1984-01-02 06:00' [ 3.6008]
'1984-01-02 09:00' [ 2.5720]
'1984-01-02 12:00' [ 7.2016]
'1984-01-02 15:00' [10.2880]
'1984-01-02 18:00' [11.8312]
'1984-01-02 21:00' [12.3456]
'1984-01-03 00:00' [10.2880]
which don't contain summaries when fewer than 3 entries are available.