[GIS] Setting up THREDDS catalogs for ocean model data

netcdfopendapthreddswcsxml

I've got a bunch of NetCDF files from ocean modeling runs. Some are
3-day forecasts that are run daily, some are hindcast runs where each
hindcast is split into multiple files like ocean_his_0001.nc,
ocean_his_0002.nc, etc, and some are just random netcdf files required
for input (forcing files, grid files, etc).

I would like to serve these data using the Unidata THREDDS Data Server, so my data can be made available via OPeNDAP, WCS and WMS, but the documentation is a bit overwhelming.

Are there some simple suggestions for how to set up THREDDS catalogs to handle
this diverse collection of model-related netcdf files?

Best Answer

Here's what we've been doing to set up THREDDS Data Server (TDS) catalogs for regional oceanographic modeling providers in the US Integrated Ocean Observing System to serve their models results.

There are four basic types of catalogs we have been setting up:

  • A top level catalog that points to other catalogs that you want exposed
  • An "all" catalog that automatically scans a directory tree for netcdf (and grib, etc) files
  • Catalogs that aggregate regional model results by concatenating along the time dimension
  • Catalogs that aggregate forecast model results by using the special Forecast Model Run Collection feature of the TDS.

So we'll go through each type. But before modifying any catalogs, verify that TDS is up and running with the test catalog and datasets. Go to http://localhost:8080/thredds and drill down on one of the test data sets to the OpenDAP service to make sure everything looks okay in the OpenDAP Data Access page.

Top level catalog (catalog.xml)

I use the top level catalog as a table of contents whose sole purpose is to point to other catalogs that you want to advertise. The following catalog.xml example is simply pointing to two regional modeling catalogs:

<catalog xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
    xmlns:xlink="http://www.w3.org/1999/xlink"
    name="THREDDS Top Catalog, points to other THREDDS catalogs" version="1.0.1">

    <dataset name="NCSU MEAS THREDDS catalogs">
        <catalogRef xlink:href="gomtox_catalog.xml" xlink:title="GOMTOX (Gulf of Maine) Ocean Model" name=""/>
        <catalogRef xlink:href="sabgom_catalog.xml" xlink:title="SABGOM (South Atlantic Bight and Gulf of Mexico) Ocean Model" name=""/>
    </dataset>

</catalog>

The "All" Catalog (all.xml)

It is quite convenient to have a catalog that automatically allows you to access to all data files in a particular directory tree via the TDS services. The datasetScan feature in the TDS scans a specified directory tree for files matching certain patterns or file extensions.

This could be your whole disk, or just a particular directory. In the following example, the TDS will scan the /data1/models directory for all NetCDF, Grib, or HDF files, sort them by alphabetical order, and include the file size. The data will be served via OpenDAP and HTTP, with HTTP just allowing people to download the existing file in it's native format.

<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
    xmlns:xlink="http://www.w3.org/1999/xlink"
    name="THREDDS Catalog for NetCDF Files" version="1.0.1">
    <service name="allServices" serviceType="Compound" base="">
        <service name="ncdods" serviceType="OpenDAP" base="/thredds/dodsC/"/>
        <service name="HTTPServer" serviceType="HTTPServer" base="/thredds/fileServer/"/>
        <service name="wcs" serviceType="WCS" base="/thredds/wcs/"/>
        <service name="ncss" serviceType="NetcdfSubset" base="/thredds/ncss/grid/"/>
        <service name="wms" serviceType="WMS" base="/thredds/wms/"/>
        <service name="iso" serviceType="ISO" base="/thredds/iso/"/>
        <service name="ncml" serviceType="NCML" base="/thredds/ncml/"/>
        <service name="uddc" serviceType="UDDC" base="/thredds/uddc/"/>
    </service>

    <datasetScan name="Model Data" ID="models" path="models" location="/data1/models">
        <metadata inherited="true">
            <serviceName>allServices</serviceName>
            <publisher>
                <name vocabulary="DIF">USGS/ER/WHCMSC/Dr. Richard P. Signell</name>
                <contact url="http://www.usgs.gov/" email="rsignell@usgs.gov"/>
            </publisher>
        </metadata>
        <filter>
            <include wildcard="*.ncml"/>
            <include wildcard="*.nc"/>
            <include wildcard="*.grd"/>
            <include wildcard="*.nc.gz"/>
            <include wildcard="*.cdf"/>
            <include wildcard="*.grib"/>
            <include wildcard="*.grb"/>
            <include wildcard="*.grb2"/>
            <include wildcard="*.grib2"/>
        </filter>
        <sort>
            <lexigraphicByName increasing="true"/>
        </sort>
        <addDatasetSize/>
    </datasetScan>

</catalog>

You could reference this catalog in your catalog.xml file, or you might feel that advertising a link to all your data files would be confusing to some users. If you don't put the catalog in catalog.xml, you must add a reference to it in the threddsConfig.xml file in order for it to be read by the TDS.

So if your catalog is called "all.xml", you would need a line in threddsConfig.xml that looks like this:

<catalogRoot>all.xml</catalogRoot>

Regional model catalogs

I suggest that you use a separate catalog for each model domain so that others can link to your catalogs in their own THREDDS catalogs in a more flexible way (e.g. your catalog for Boston Harbor could be referenced in a regional catalog for the Gulf of Maine).

For regional model results, there are typically two types of aggregation datasets that are useful. One aggregates along an existing time dimension, so use type="joinExisting":

<aggregation dimName="ocean_time" type="joinExisting">
    <scan location="/media/1tb/MABGOM/Jun292008_Feb282009" regExp=".*mabgom_avg_[0-9]{4}\.nc$"/>
</aggregation>

where you can use a regular expression (java style) to match only certain files in a directory. Here we are matching files that looks like "mabgom_avg_0001.nc". The "." means any character, so ".*" means any number of any character followed by "mabgom_avg_" followed by exactly 4 digits between 0 and 9, followed by exactly ".nc".

So the entire catalog might look like:

<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0 http://www.unidata.ucar.edu/schemas/thredds/InvCatalog.1.0.3.xsd"
 xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
 xmlns:xlink="http://www.w3.org/1999/xlink"
 name="MABGOM Catalog">
    <service name="allServices" serviceType="Compound" base="">
        <service name="wms" serviceType="WMS" base="/thredds/wms/"/>
        <service name="iso" serviceType="ISO" base="/thredds/iso/"/>
        <service name="ncml" serviceType="NCML" base="/thredds/ncml/"/>
        <service name="uddc" serviceType="UDDC" base="/thredds/uddc/"/>
    </service>

    <dataset name="MABGOM Runs">

        <metadata inherited="true">
            <serviceName>allServices</serviceName>
            <creator>
                <name vocabulary="DIF">Dr. Ruoying He</name>
                <contact url="http://www.meas.ncsu.edu/faculty/he/he.html"
                    email="ruoying_he@ncsu.edu"/>
            </creator>
            <documentation xlink:href="http://www4.ncsu.edu/~rhe/project_files/muri.htm"
                xlink:title="MABGOM Circulation"/>
            <documentation type="Summary"> Hydrodynamic simulations for the Mid-Atlantic Bight and
                Gulf of Maine </documentation>
            <documentation type="Rights"> This model data was generated as part of an academic
                research project, and the principal investigators: Ruoying He (rhe@ncsu.edu) ask to
                be informed of intent for scientific use and appropriate acknowledgment given in any
                publications arising therefrom. The data is provided free of charge, without
                warranty of any kind. </documentation>
        </metadata>

        <dataset name="Tide-Averaged Data">
            <dataset name="Jun292008_Feb282009" ID="MABGOM/Jun292008_Feb282009/avg"
                urlPath="MABGOM/Jun292008_Feb282009/avg">
                <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
                    <aggregation dimName="ocean_time" type="joinExisting">
                        <scan location="/media/1tb/MABGOM/Jun292008_Feb282009"
                            regExp=".*mabgom_avg_[0-9]{4}\.nc$"/>
                    </aggregation>
                </netcdf>
            </dataset>
        </dataset>

        <dataset name="History Data">
            <dataset name="Jun292008_Feb282009" ID="MABGOM/Jun292008_Feb282009/his"
                urlPath="MABGOM/Jun292008_Feb282009/his">
                <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
                    <aggregation dimName="ocean_time" type="joinExisting">
                        <scan location="/media/1tb/MABGOM/Jun292008_Feb282009"
                            regExp=".*mabgom_his_[0-9]{4}\.nc$"/>
                    </aggregation>
                </netcdf>
            </dataset>
        </dataset>

    </dataset>
</catalog>

Forecast model catalogs

The other type of very useful catalog is a Forecast Model Run Collection (FMRC), which aggregates forecast files that have overlapping time records (e.g. 3-day forecasts, issued once a day).

For this type of catalog, we use the FMRC FeatureCollection, which creates a "best time series" view, using the most recent data from each forecast to construct a continuous aggregated time series. The files to be scanned are specified in the collection tag, and when the files are scanned is specified by either a recheckAfter tag in the collection tag, or in the update tag.

Here's a full example:

<catalog xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0 http://www.unidata.ucar.edu/schemas/thredds/InvCatalog.1.0.3.xsd"
    xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
    xmlns:xlink="http://www.w3.org/1999/xlink" name="OPeNDAP Data Server" version="1.0.3">

    <!-- 
        Specify the data and metadata services for this catalog
    -->
    <service name="allServices" serviceType="Compound" base="">
        <service name="ncdods" serviceType="OPENDAP" base="/thredds/dodsC/"/>
        <service name="ncss" serviceType="NetcdfSubset" base="/thredds/ncss/grid/"/>
        <service name="wms" serviceType="WMS" base="/thredds/wms/"/>
        <service name="iso" serviceType="ISO" base="/thredds/iso/"/>
        <service name="ncml" serviceType="NCML" base="/thredds/ncml/"/>
        <service name="uddc" serviceType="UDDC" base="/thredds/uddc/"/>
    </service>
    <!-- 
        Create a folder for all the FMRC Feature Collections
    -->
    <dataset name="COAWST Model Runs">
        <metadata inherited="true">
            <serviceName>allServices</serviceName>
            <authority>gov.usgs.er.whsc</authority>
            <dataType>Grid</dataType>
            <dataFormat>NetCDF</dataFormat>
            <creator>
                <name vocabulary="DIF">OM/WHSC/USGS</name>
                <contact url="http://www.usgs.gov/" email="jcwarner@usgs.gov"/>
            </creator>
            <publisher>
                <name vocabulary="DIF">OM/WHSC/USGS</name>
                <contact url="http://www.usgs.gov/" email="jcwarner@usgs.gov"/>
            </publisher>
            <documentation xlink:href="http://woodshole.er.usgs.gov/project-pages/cccp/index.html"
                xlink:title="Carolinas Coastal Change Program"/>
            <documentation xlink:href="http://geoport.whoi.edu:8081/ReadMeCOAWST.html"
                xlink:title="ReadMe.txt"/>
        </metadata>
        <!-- 
            First FMRC Feature Collection
        -->
        <featureCollection name="coawst_4_use" featureType="FMRC" harvest="true" path="coawst_4/use/fmrc">
            <metadata inherited="true">
                <documentation type="summary">ROMS Output from COAWST</documentation>
                <serviceName>allServices</serviceName>
            </metadata>
            <!-- 
                Inside the featureCollection, but outside the protoDataset, we define the NcML that happens
                before the aggregation.  To get aggregated, we must have grids, so we turn the bed params
                into grids by giving them a psuedo coordinate in Z.  If we don't do this, they will not be 
                aggregated. 
            -->
            <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
                <variable name="Nbed" shape="Nbed" type="double">
                    <attribute name="long_name" value="pseudo coordinate at seabed points"/>
                    <attribute name="standard_name" value="ocean_sigma_coordinate"/>
                    <attribute name="positive" value="up"/>
                    <attribute name="formula_terms" value="sigma: Nbed eta: zeta depth: h"/>
                    <values start="-1.0" increment="-0.01"/>
                </variable>
                <attribute name="Conventions" value="CF-1.0"/>
            </netcdf>

            <!-- 
                Specify which files to scan for the collection, and say when to scan them.
                (here we scan at 3:30 and 4:30 every morning.  4:30 is just in case the model
                finishes late)
            -->
            <collection spec="/usgs/vault0/coawst/coawst_4/Output/use/coawst_us_#yyyyMMdd_HH#.nc$"
                olderThan="10 min"/>
            <update startup="true" rescan="0 30 3,4 * * ? *" trigger="allow"/>

            <!-- 
                Specify the dataset to use for non-aggregated variables and 
                global attributes. NcML changes here are applied after the data
                has been aggregated. 
            -->
            <protoDataset choice="Penultimate">
                <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
                    <variable name="temp">
                        <attribute name="_FillValue" type="float" value="0.0"/>
                    </variable>
                    <variable name="salt">
                        <attribute name="_FillValue" type="float" value="0.0"/>
                    </variable>
                    <variable name="Hwave">
                        <attribute name="_FillValue" type="float" value="0.0"/>
                    </variable>
                    <variable name="zeta">
                        <attribute name="_FillValue" type="float" value="0.0"/>
                    </variable>
                </netcdf>
            </protoDataset>
            <!-- 
                Specify what datasets the user will access. Usually we just 
                want the "best time series" aggregation. 
            -->
            <fmrcConfig regularize="false" datasetTypes="Best"/>
        </featureCollection>

    </dataset>
</catalog>

Hopefully this is a good starting point.

The best place to find more information on setting up the TDS is usually the documents linked from the latest TDS tutorial from Unidata.

As I type this, the most recent is: https://www.unidata.ucar.edu/software/thredds/current/tds/tutorial/workshop2014.html

which links to: https://www.unidata.ucar.edu/software/thredds/current/tds/tutorial/GettingStarted.html https://www.unidata.ucar.edu/software/thredds/current/tds/reference/collections/FeatureCollections.html

Related Question