MATLAB: Australian Bureau of Meteorology (BOM): Extracting Data from historic Local Waters Forecast

australiabombureau of meteorologydataforecastmarine forecastregexpregular expressionssearchseasswelltext filexch10syd

Dear Matlab Forums,
As part of my thesis I'm investigating the accuracy of BOM's forecast data. I'm trying to see how accurate their maximum predictions are to determine if evasive action may need to be taken for fixed marine structures. I have the historic wavebuoy data, but want to know if BOMs forecasts have been accurate.
BOM is absolutely fantastic at providing data in erratic text format and seems not to care much about useable CSV formats – it's killing me.
What I want from the data:
  • The date
  • The most far reaching swell and seas size forecasts
The data comes in the following formats:
*11:30 28/05/2008* 100398833 CGWEB=11:30
*AIFS_ID=11400* CGWEB
IDW11400
Australian Government Bureau of Meteorology
Western Australia
Local Waters Forecast
Yanchep to Mandurah and Offshore to Rottnest Island
Issued at 11:30 am WST on Wednesday 28 May 2008
Valid until midnight Friday
Please Be Aware
Wind gusts can be 40 percent stronger than the averages given here, and maximum
wave may be up to twice the height.
Warnings
Nil.
Synoptic Situation
A moderate cold front is currently passing over Perth. Fresh W/SW winds behind
front will ease during the afternoon and evening.
Forecasts:
Wednesday until midnight: W/SW winds 18/23 knots easing to 13/18 knots during
the afternoon and becoming SW'ly 10/15 knots in the evening. Seas 1.5m to 2.0m.
Swell rising to 3.0m.
Swell at Cottesloe: rising to 1.0m.
Winds on Melville Water: similar.
Thursday: S/SW winds 8/13 knots tending S/SE 5/10 knots in the evening. Seas to
1.0m. Swell to 3.0m easing later.
Friday: E/NE winds 8/13 knots tending N/NE 10/15 knots in the evening and
increasing to N/NE 15/20 knots towards midnight.
Current Swell Observations:
Rottnest Waverider Buoy: 1.7m
Cottesloe Waverider Buoy: 0.6m
Current swell height information is supplied by the Department for Planning and
Infrastructure and is current only at the time of issue of this forecast
The next routine forecast will be issued at 4:30 pm WST Wednesday.
*16:30 28/05/2008* 100444518 CRAFA=16:30 CGFCS=16:30
XCH10SYD-0296501221=16:31 ENOVAFM=16:30 CGIDF=16:30
*AIFS_ID=11400* CRAFA CGFCS XCH10SYD ENOVAFM PROD CGIDF
IDW11400
Australian Government Bureau of Meteorology
Western Australia
Local Waters Forecast
Yanchep to Mandurah and Offshore to Rottnest Island
Issued at 4:30 pm WST on Wednesday 28 May 2008
Valid until midnight Saturday
Please Be Aware
Wind gusts can be 40 percent stronger than the averages given here, and maximum
wave may be up to twice the height.
Warnings
Nil.
Synoptic Situation
A moderate cold front is currently passing over Perth. Fresh W/SW winds behind
front will ease during the evening.
Forecasts:
Wednesday until midnight: W/SW winds 15/20 knots easing to SW'ly 10/15 knots
during the evening. Seas 1.0m to 1.5m. Swell to 2.5m to 3.5m.
Swell at Cottesloe: to 1.0m.
Winds on Melville Water: similar.
Thursday: S'ly winds 8/13 knots tending SE'ly 8/13 knots in the evening. Inshore
winds tending E/SE 5/10 knots for a period early to mid morning. Seas to 1.0m.
Swell 2.5m to 3.0m
Swell at Cottesloe: to 1.0m.
Winds on Melville Water: will be similar.
Friday: E'ly winds 8/13 knots tending NE'ly 10/15 knots towards midnight. Seas
to 1.0m. Swell to 2.0m, easing.
Saturday: N'ly winds 13/18 knots increasing to NW'ly 20/25 knots during the
morning.
Current Swell Observations:
Rottnest Waverider Buoy: 2.2m
Cottesloe Waverider Buoy: 0.8m
Current swell height information is supplied by the Department for Planning and
Infrastructure and is current only at the time of issue of this forecast
The next routine forecast will be issued at 11:30 pm WST Wednesday.
I've noticed the XCH10SYD code is used whenever a swell & seas forecast are produced – which is a great identifier of the information I'm looking for. Therefore I'm trying to find a way of getting my program to search through the 96,000 lines of ".txt" to search out the "XCH10SYD" classifier. When its found, the program saves the relevant time and date (listed a few lines above), then saves the furthest forecast's date (in this example it's Friday) and associated maximum swell and seas figures.
Things to note:
  • Sometimes seas/swell are listed as "Seas 1.0m", othertimes they're listed as "Seas to 1.0m", and sometimes its listed as "Seas 1.0m to 2.0m". In the latter case, I'm only interested in the maximum value.
  • Sometimes when a particularly long forecast is produced, the number of lines of text changes. ie The code can't really be hard coded to extract data from a particular spot, but has to be flexible to actively search for the numerical data.
  • The HH:MM DD/MM/YYYY and AIFS identifiers seem to be consistent in their location and format. (Perhaps the only consistent aspect of the txt file.
If anyone can even provide advice on where to start, it would be much appreciated. I'm really not a pro in this field, but keen to learn. This is just way beyond my current skillset. Thank you!

Best Answer

This is a good candidate for using regular expressions and pattern matching. It would take an hour of your time reading section 2-23 of MATLAB Programming Fundamentals (available here) to get started with regular expressions (as well as quite a bit of experimenting). I develop an example below, which is probably not exactly what you need, because your sample is too short for me to experiment, but I can help you refine the approach.
Assuming that all relevant blocks start with a date/time with the following structure
*11:30 28/05/2008*
we first split the file content in blocks, using this structure as a separator:
% - Read file content in one shot.
content = fileread( 'data.txt' ) ;
% - Split time/data blocks.
pattern = '\*\d\d:\d\d \d\d/\d\d/\d{4}\*' ;
[timeBlocks, dataBlocks] = regexp( content, pattern, 'match', 'split' ) ;
dataBlocks(1) = [] ;
It outputs a cell array of time stamps and a matching cell array of block data:
>> timeBlocks
timeBlocks=
'*11:30 28/05/2008*' '*16:30 28/05/2008*'
>> dataBlocks
dataBlocks =
[1x1459 char] [1x1458 char]
Then we convert date/time data into whatever you need, using e.g. DATEVEC or DATENUM:
dateTime = datevec( timeBlocks, '*HH:MM dd/mm/yyyy' ) ;
which outputs
>> dateTime
dateTime =
2008 5 28 11 30 0
2008 5 28 16 30 0
Finally, we iterate through data blocks and extract data
nBlocks = numel( dataBlocks ) ;
data = cell( nBlocks, 1 ) ;
for bId = 1 : nBlocks
% - Extract days and distances.
pattern = '[\r\n](\S+day).*?Seas.*?to\s+([^m]+)' ;
tokens = regexp( dataBlocks{bId}, pattern, 'tokens' ) ;
tokens = vertcat( tokens{:} ) ;
% - Convert distances to double, and cell array to struct array.
if ~isempty( tokens )
tokens(:,2) = num2cell( str2double( tokens(:,2) )) ;
data{bId} = cell2struct( tokens, {'day', 'distance'}, 2 ) ;
end
end
With that, we get:
>> data
data =
[2x1 struct]
[2x1 struct]
>> data{1}
ans =
2x1 struct array with fields:
day
distance
>> data{1}(1)
ans =
day: 'Wednesday'
distance: 2
>> data{1}(2)
ans =
day: 'Thursday'
distance: 1
This illustrates one way to do it. Pattern matching can be improved, and you will want to modify the structure of the output for it to fit with your needs.
Let me know if you have any question.
PS: the best that you can do to understand is to run the code step by step using the debugger, and see what happens each time a line is executed, e.g. when we get tokens it is a cell array of cell arrays, then we VERTCAT its content to transform it into a simple/flat cell array, then we convert to double its second column, etc.
To use the debugger, set a break point by clicking on the dash at the right of the line number, execute the code (a green arrow will appear, indicating the next line to execute), and click on the Step button. At each step, you can use the command window/workspace/editor (mouse over) to see the state/content of variables.
EDIT: just a few extra explanations about patterns.
The first pattern
pattern = '\*\d\d:\d\d \d\d/\d\d/\d{4}\*'
is fairly easy to understand:
  • \* means: the character *; it has to be escaped because the star has a special signification otherwise: it is a quantifier that means "zero or more times the expression that precedes".
  • \d means: any digit 0-9
  • \d{4} means: four times \d
The second pattern
pattern = '[\r\n](\S+day).*?Seas.*?to\s+([^m]+)' ;
is more complex. REGEXP matches the whole pattern but extracts only the parts in parentheses (called tokens).
  • |[\r