Skip to end of metadata
Go to start of metadata

Data Product Leads Drive Core Data Product Creation

All of the steps to specify and transform raw data from instruments into Level 1 and Level 2 OOI core data products are set up and executed.

Summary

All of the steps to specify and transform raw data from instruments into Level 1 and Level 2 OOI core data products are set up and executed. Beyond the initial specification of the algorithms, these include the implementation of algorithms to perform conversion to scientific units, application of quality control, and creation of new data products from multiple input data sources.

The existence of Level 0 data for the required instrument signals is assumed.

Review Status Ready for OOI Review
AS Priority 5
AS Version 3.2

The scenario has no diagram.

Issues Status (Jira) OverviewAllUnresolved
Custom Issues Lists Marine IO ReviewMarine IO ProcessesCI IO Verify

The custom issue lists are as follows. They include both open tasks, and tasks marked as fixed.

  • Marine IO Review issues are called to the attention of the Marine IOs for their review.
  • Marine IO Processes issues are expected to require further consideration/understanding of the Marine IO processes.
  • CI IO Verify issues are generally resolved, but the resolution needs to be confirmed with appropriate CI experts.

Outline

Related Use Cases

Use Cases Mapped to This Scenario

The following Use Cases have been mapped to this Acceptance Test Scenario:

Use Cases Cited by This Scenario

This Acceptance Test Scenario cites the following Use Cases:

Key

This text style = background material
This text style = priority <3> (not required).

( ) Indicates footnoted material targeted for Release 3.
( ) Indicates footnoted material targeted for Release 4.
[MI] , [Ops] Provided by MI or Ops team (has no use case).
[NoUC] indicates material for which no Use Case exists.

Other References

Documents in this section are maintained in the OOI Alfresco system (login required). Please see [science:Data Product Specifications] page for up-to-date references.

  • Data Product Specification Approval Procedure
  • Data Product Specification Outline
  • Data Processing Flow Template
    • A better example is the 1342-00010 CTD Data Processing Flow under 'CTD' in the Approved Specifications section
  • OOI Data Products Tracking Sheet

Overview Diagram

Not available.

Roles

CI Point of Contact: CI person named in the OOI Data Products Tracking Sheet as the point of contact for a given algorithm. (In this case, Ted is an Integrated Observatory Operator serving as a Data Process Programmer.) Can create and manage data processing resource (Data process definition, Data visualization process definition); register external sources (Data process definition, Data visualization process definition); and schedule data process execution.

Data Product Lead: OOI person responsible for specifying the Data Product algorithm and determining when and what type of changes need to be made.

Resource Operator: A Registered User who has been granted privileges to select resources within an observatory. (In this case, Dr. L is assigned privileges to manage certain data processes.) Can edit resource settings and attributes, create an event related to the resource, and manage annotational material of the resource.

End-to-End Scenario

Dr. L is an OOI Project Scientist responsible for developing and managing the processing steps necessary to create core CTD data products. These include level 1 conductivity, temperature, and depth data, and Level 2 salinity and density data.

Relevant Use Cases
UC.R2.03 --- Produce Real-Time Calibrated Data

Describe Product Algorithm(s)

Dr. L, the Data Product Lead for OOI, describes the processing steps necessary to create each core data product by writing a Data Process Specification (DPS) for each step, and producing a Data Product Flow Diagram describing how the Data Processing Specifications are sequenced. Each DPS is a human-readable document that describes the algorithm and its theory and limitations, including example code or pseudocode where available. It also defines the inputs needed, and the output data and metadata, including any quality flags, and provides test input and output data. ( 1 ) After soliciting appropriate stakeholder review, Dr. L submits the DPS to the Senior Project Scientist at Ocean Leadership and the Marine Integration Lead in the CI IO for approval, as specified in the DPS Approval Process. In this case, Dr. L wishes to create the Level 2 density data from Level 1 conductivity, temperature, and depth data, that then have range checks and tests for spikes, stuck values, and trends performed. Each of these quality checks is documented in a quality control DPS, as shown in the Data Processing Flow Diagram. Dr L provides the DPSs and the Data Product Flow Diagram to the CI point of contact, Ted, who is specified in the OOI Data Products Tracking Sheet. Note: The CTD Data Processing Flow Diagram provides an example of the complete set of these steps, and the OOI Data Management Plan documents the nomenclature used to specify different levels of processing (Level 0, Level 1, Level 1a, etc.).
Future Release Notes
Release 3

Encode Product Transform

Relevant Use Cases
UC.R2.03 --- Produce Real-Time Calibrated Data
UC.R2.21 --- Transform Data in Workflow
UC.R2.47 --- Define Executable Process
UC.R2.48 --- Schedule Process for Execution

Ted accepts the DPS and creates a Transform within the Integrated Observatory system for each distinct processing step in the DPS. To create this transform, he develops code based on either the DPS-provided code or the DPS-provided algorithm. The code serves as a Data Process Definition in the Integrated Observatory. He then uses that definition as the basis of a Transform that can be executed within the Integrated Observatory. In the case of Level 1 data, for example, the Transform may be a calibration applied to the real-time data that arrives ( 1 ). (See UC.R2.47 Define Executable Process, UC.R2.03 Produce Real-Time Calibrated Data and UC.R2.21 Transform Data in Workflow.)

Ted tests each Transform by applying it to the supplied test data, and communicates with Dr. L on any failures or concerns. The results of the test are archived for future reference. As Transforms depend on the previous Transform for their input, Ted can specify their linkage. Through chaining of different transforms, Ted can create the entire flow specified by the Data Process Flow Diagram. (Multiple Transforms can receive their input from a single predecessor Transform.)

As this is a routine core Data Product Specification, it gets executed whenever the raw data from the corresponding sensor arrives. (The raw data arrival is the trigger for executing the process, in other words.) If there are multiple inputs that feed a particular process, and only one serves as the trigger, the DPS must specify which input serves as the trigger. On the other hand, if the process should execute only at a given interval — for example, if it produces hourly statistics — the appropriate time trigger should be specified in the DPS, and will be scheduled accordingly. ( 2 ) (See UC.R2.48 Schedule Process for Execution.)

( 3 )

Future Release Notes
Release 3

Activate Product Transform

Relevant Use Cases
UC.R2.03 --- Produce Real-Time Calibrated Data
UC.R2.21 --- Transform Data in Workflow
UC.R2.23 --- Ingest Data Supplement
UC.R2.28 --- Manage Resource Metadata

After Ted receives permission from Dr. L, the Transforms are activated, and the data product generation begins as soon as the source data is available. These configured transforms are acting on streams, not stored data, so they only execute as 'new' data arrives. (See UC.R2.23 Ingest Data Supplement.) For this reason, the series is activated starting with the last one, and preceding back to the first; then all the Transforms are performed in sequence once the first generates a product. (See UC.R2.03 Produce Real-Time Calibrated Data and UC.R2.21 Transform Data in Workflow.)

However, as noted below it may be necessary to reprocess data using a particular transform — in this case a separate process retrieves the data from storage, and feeds it to the Transform as a stream.

The data products from each Transform are associated with various supporting resources, including the DPS and Data Process Flow Diagram, the input data set, any associated quality flag data products for reference, and the data product's metadata. (See UC.R2.28 Manage Resource Metadata.)

Human in the Loop procedures

Relevant Use Cases
RSN.UC.5 --- Sensor Data Validation - CTD
RSN.UC.6 --- Sensor Data Validation-Hydrophone
UC.R2.04 --- Browse to Get Data Product
UC.R2.18 --- Visualize Data Product
UC.R2.19 --- Produce Matlab Visualization
UC.R2.20 --- Annotate Resource in Registry
UC.R2.24 --- Search for Resource
UC.R2.29 Integrate External Dataset

In some cases, a data product is not created entirely by software, but requires human action. To create the Level 1c (human-in-the-loop quality controlled) data products for Conductivity, Temperature, and Depth, Dr L looks at graphs of the values against time for several time periods in the data product. (See UC.R2.18 Visualize Data Product.) She uses Matlab to create some plots, so she can check for problems that aren't caught by the QC procedures, but that are easily visible to the human eye. (See UC.R2.19 Produce Matlab Visualization.)

When she sees odd values, she checks for any error messages associated with the input datasets; she finds these by following the links that the Integrated Observatory provides from this data set to associated resources, for example the source instruments, and from them to their logs. (See UC.R2.26 Navigate Resources and Metadata.) Dr. L also wishes to compare the results to CTD data nearby, so she uses the Browse and Search capabilities of the Integrated Observatory to find other CTD data from a geospatial box of interest, and view graphs of those as well. (See [UC.R2.04 Browse to Get Data Set] and UC.R2.24 Search for Resource.)

Dr. L flags each of the odd values. Each flag has two components that are attached to the value(s) in question: the type of error that has been detected (chosen from a controlled vocabulary), and (optionally) explanatory text or references to possible explanations. These flags are associated with the original data, either as an annotation that refers to the related data (the reference can be to a range of data), or within a modified data set that references each data item to which it applies. The flags are produced outside of the Integrated Observatory system, and recorded in a format (to be determined; netCDF Climate and Forecast conventions provide one good model) that will support scientific analysis of the data.

The flags that Dr. L has produced can be submitted back to the Integrated Observatory, if desired. They can be submitted as an annotation to an existing data product (see UC.R2.20 Annotate Resource in Registry), or as a new data product (see UC.R2.29 Integrate External Dataset).

If Dr. L wants to produce a new data value because the original is suspect, this involves creating a new data product, which can contain the replacement values. A new data product — whether the modified data set in the previous paragraph or this data with some new values in it — must include provenance referencing the original data and the processing that was applied to produce the new results. The new data product can be submitted to the Integrated Observatory, along with the specified metadata. If the new data product was produced by a formal OOI process intended to be repeated, the submission may be facilitated through automated data acquisition procedures set up by Ted. On the other hand, a one-off human-in-the-loop QC review would be submitted manually by the human performing the review, following similar procedures as for other data submitted by early adopter data contributors. (See UC.R2.29 Integrate External Dataset.)

Update Product Transform with New Version

Relevant Use Cases
UC.R2.08 --- Manage Instrument Lifecycle
UC.R2.18 --- Visualize Data Product
UC.R2.22 --- Version Data Set

Initially, Dr. L may designate that the generated data products remain private, while she verifies the results on real data streams. She can create a subscription so that she is notified when there is a supplement to one of the core data products for which she has provided downstream Transformations. (See the analog UC.R2.08 Manage Instrument Lifecycle.) She views and downloads the contents of the data product, and also views graphs of values against time. (See UC.R2.18 Visualize Data Product.)

If she determines that the algorithm needs to be adjusted, she creates a new version of the DPS and provides it to Ted for implementation. The Data Process Definition and Transform are versioned resources within the Integrated Observatory, referencing the appropriate version of the DPS. The data product content created with the new software likewise has a new version, and the data content replaced by the new data product is marked as deprecated. (See UC.R2.22 Version Data Set.) Data products created with old algorithms retain links to the appropriate version of their DPSs and Processes. When it is deemed appropriate, the DPSs and Process also are marked as deprecated.

The subscription that Dr. L created may have to be updated as well. Typically a subscription refers to the best (i.e., non-deprecated) version of a data product, so that the subscription remains current even if a new version of the data content is being produced. But if Dr. L wants to watch for updates to the deprecated version, she may want to subscribe specifically to that version. If a new version of a Transform needs to be run against archival data, Ted must be informed of the data product(s) ( 1 ) which must be used as input. He creates a new Transform that gets its data, not from the real-time data product source, but from the archived form of that input data product. His execution of the process produces new versions of processed data products, that deprecate the previously created (and possibly persisted) processed data products. ( 2 ) The deprecation/replacement of the previous products is noted in metadata and/or annotations made to the deprecated data products. ( 3 ) (See UC.R2.08 Manage Instrument Lifecycle and UC.R2.22 Version Data Set.)
Future Release Notes
Release 3

Publicize Data Product(s)

Relevant Use Cases
UC.R2.26 --- Navigate Resources and Metadata
UC.R2.42 --- Define Resource Policy

When ready, Dr. L changes the resulting data products from private to public. The system verifies that Dr. L has permission to do so based on her role in the Integrated Observatory, and the data policy for this data product. (See UC.R2.42 Define Resource Policy.)

The data products are now searchable and accessible to all users through the Integrated Observatory interface, and any access pathway for the data also provides links to the supporting DPSs, input data, and Process Definition metadata to provide full provenance. (See UC.R2.26 Navigate Resources and Metadata.)

Transition Data Sources

Relevant Use Cases
UC.R2.24 --- Search for Resource
UC.R2.26 --- Navigate Resources and Metadata
UC.R2.28 --- Manage Resource Metadata

The system also supports transitioning between sensors. Consider the case where a decision is made within OOI to replace the original CTD sensors with a new model. All Data Process Definitions are searchable resources in the Integrated Observatory portal, so Dr. L can find and access the Data Process Definitions for CTD data product generation. (See UC.R2.24 Search for Resource.) These have links to the Data Product Specification documents on which they were based, and so those documents can be found and reviewed. (Note the primary source of Data Product Specification documents may be either the Integrated Observatory or Alfresco.)

Dr. L is able to adjust the DPS if necessary to accommodate the new CTDs data structure — though occasionally the Level 0 signal/data product produced by the new CTD is an entirely equivalent input ( 1 ) — and make any other appropriate changes to accommodate the new instrument. As discussed above, versioning capabilities for related artifacts and resources Definitions allow data products to maintain links to the version of the DPSs and Data Processes that generated them, while also clearly indicating deprecated products. (See UC.R2.26 Navigate Resources and Metadata.)

( 2 )

All source instrument instances' make and model, with any known accuracy and performance information, are provided or referenced in the metadata of the data product, so that end users are aware of changes and can appropriately consider any impacts on their use of the data. (See UC.R2.28 Manage Resource Metadata.)

Future Release Notes
Release 3

Labels

r2-acceptancescenariodetail r2-acceptancescenariodetail Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Apr 25, 2012

    Maurice Manning says:

    SA uses Data Process Definition in the way it is used here. It uses Data Process...

    SA uses Data Process Definition in the way it is used here. It uses Data Process instead of Transform. Transform, in the code base, is only a DM term.