Skip to end of metadata
Go to start of metadata

Curate Data Products

All of the steps to curate OOI core data products, data products from external providers, and additional science data products are detailed

Summary

All of the steps to curate OOI core data products, data products from external providers, and additional science data products (e.g. from PI-owned instruments) are detailed. This scenario assumes that the capabilities defined in AS.R2.04A Product Scientists Drive Core Data Product Creation exist.

Review Status Ready for OOI Review
AS Priority 4
AS Version 3.2

The scenario has no diagram, and may not get one.

Issues Status (Jira) OverviewAllUnresolved
Custom Issues Lists Marine IO ReviewMarine IO ProcessesCI IO Verify

The custom issue lists are as follows. They include both open tasks, and tasks marked as fixed.

  • Marine IO Review issues are called to the attention of the Marine IOs for their review.
  • Marine IO Processes issues are expected to require further consideration/understanding of the Marine IO processes.
  • CI IO Verify issues are generally resolved, but the resolution needs to be confirmed with appropriate CI experts.

Outline

Related Use Cases

Use Cases Mapped to This Scenario

The following Use Cases have been mapped to this Acceptance Test Scenario:

Use Cases Cited by This Scenario

This Acceptance Test Scenario cites the following Use Cases:

Key

This text style = background material
This text style = priority <3> (not required).

( ) Indicates footnoted material targeted for Release 3.
( ) Indicates footnoted material targeted for Release 4.
[MI] , [Ops] Provided by MI or Ops team (has no use case).
[NoUC] indicates material for which no Use Case exists.

Overview Diagram

Not available.

Roles

  • Data Curator (Data Resource Manager) — CI staff person responsible for Data Curation activities and policies
  • Observatory Manager

References

End-to-End Scenario

The Data Curator, or others working on OOI data curation tasks, will carry out the following tasks within the Integrated Observatory (for simplicity, the scenario uses "data curator" in place of "data curator or others with data curation responsibilities"). In addition, the Data Curator will rely on many of the capabilities specified on other R2 acceptance scenarios, in particular those for AS.R2.04A Data Product Leads Drive Core Data Product Creation. Those use scenarios are not repeated here.

View summary information for a selected data product

Relevant Use Cases
UC.R2.04 --- Browse to Get Data Product
UC.R2.24 --- Search for Resource
UC.R2.26 --- Navigate Resources and Metadata
UC.R2.40 --- Monitor ION Resources
UC.R2.53 --- View Modeler-Submitted Products

Once a data product is selected (through search or browse capabilities, see UC.R2.04 Browse to Get Data Product and UC.R2.24 Search for Resource) the data curator will view the full set of attributes for that data product, and will follow provided links to associated information such as annotations and Event Logs. (See UC.R2.26 Navigate Resources and Metadata.)

While the full suite of attributes for each data product will be visible, the following set are those attributes of most relevance to data curation. (See UC.R2.40 Monitor ION Resources.)

  • Curation Catogory
  • Product Level (e.g. L0, L1, L2)
  • Quality Control level
  • Creation Date
  • Version (Lifecycle) status, e.g. current, deprecated
  • Size
  • Any annotations
  • Active subscriptions: who has subscriptions, ( 1 ) and when those were created
  • Past subscriptions: who had subscriptions, ( 1 ) and when those were created and stopped
  • Download information: who downloaded the data product, ( 1 ) and when
  • Associated resources, such as quality flags and metadata validity
  • The process that created the data product, with links to the full process and instrument metadata
  • Invariant URL information: Whether invariant IDs have been given for this data product, and if so the full list of user and date to which they were provided
  • exclusive rights status, exclusive rights end date, exclusive rights contact and any needed exclusive rights notes
  • ( 2)

The data curator will also search by data process, including optionally a version number, and receive a table showing all of the data products created directly by that process. ( 3 ) ( 4 ) (See UC.R2.24 Search for Resource.)

Future Release Notes
Release 3

Evaluate status of data product processing

Relevant Use Cases
UC.R2.24 --- Search for Resource
UC.R2.40 --- Monitor ION Resources
UC.R2.41 --- Recover Failed Process
OOI is responsible for providing a defined set of core data products, which have defined processing workflows (e.g. L0 data is transformed to L1a data, L1a data is subjected to a series of automated QC steps to produce L1b data, etc.) The Data Curator assesses the completeness of core data product production by assessing whether data products are completing all expected processing steps.

The Data Curator searches data resources by combinations of: data processing level (L0, L1, L2), QC level (a, b, c), and deprecation status as well as by site, platform, data product name, creation date, contact (i.e. Project scientist responsible for this data product), provider (e.g. external observatory name) and date of last update. ( 1 ) (See UC.R2.24 Search for Resource and UC.R2.40 Monitor ION Resources.)

The Data Curator subscribes to failure alerts for certain processes. (See UC.R2.41 Recover Failed Process.)

To evaluate whether there are core data products that are not being updated as expected, she searches for cases where L1b data product modification times are older than the L1a inputs from which they are derived (and similarly for L1c vis L1b, and so on). To identify possible failures to create a data product at all, she compares lists of data products to see whether the L1a conductivity data products and L1b conductivity data products represent the same set of instruments instances. (This might be performed using search commands external to the Integrated Observatory software.) As these searches combine Integrated Observatory searches with external manipulation of the results, she exports the results of Integrated Observatory searches by copy-and-paste into another application ( 2 ).

The Data Curator often has searches that are idiosyncratic and can not be constructed using the limited search criteria on the search form. Because she is familiar with the search syntax, she enters these searches into the textual search query form that is available at the search page, and submits the query. Any search results are returned as a collection, so that she can see the common metadata and perform common operations on the resulting collection of resources.
Release 3

Perform data management activities

Relevant Use Cases
UC.R2.42 --- Define Resource Policy

In carrying out data curation activities, the data curator requires authority in the system beyond that of a regular user, and close to that of an Integrated Observatory Operator. The assignment of the required authorities is provided via a Data Resource Manager role, a constrained form of the Integrated Observatory Manager.

The full set of authorities associated with this user role include the ability to:
  • search on and view all attributes ( 1 ),
  • view all data products, including those that have been deprecated or retired and those that are private,
  • manually deprecate (mark as deprecated) or retire (make no longer usable) data products,
  • define policy for data products and their retention (see UC.R2.42 Define Resource Policy)
  • manually change data from public to private and vice-versa, and
  • manually edit the contents of those attributes that are editable ( 2 )
  • ( 3 )
Release 3

Contact users of a data product

Relevant Use Cases
UC.R2.20 --- Annotate Resource in Registry
UC.R2.26 --- Navigate Resources and Metadata

The data curator must contact users who have subscribed to, or downloaded, a particular data product. This is needed to warn users if a serious error is found in the data product, or if the data product is slated for deletion in the Integrated Observatory active subscribers should be notified in advance. Once the curator knows of a data product with issues, she inspects the data product resource to see its attributes for the current and past users. (See UC.R2.26 Navigate Resources and Metadata.)

If a problem is common to multiple data products, the data curator identifies the set of data products instead of an individual product), and views aggregate tables compiling usage attributes from all the data products in the set. (For example, one search result could thereby aggregate all the subscriptions from all the data products identified). Many problems affect multiple data products — potentially as many as hundreds, as for example with CTD processes — there is high value on the ability to view the product entries with their metadata fields in a single search results table. (The key attributes to view in such a table address active subscription information, past subscriber info, download information, and invariant URL information.)

After sending an email to users, the data curator creates an attachment within the integrated observatory of the email contents, and associates it with the users to whom it was sent, as well as optionally the product to which it relates. The data curator also associates the same attachment to the resource(s) that are affected. ( 1 ) (See UC.R2.20 Annotate Resource in Registry.)

Release 3

Manually deprecate, retire, or delete a data product

Relevant Use Cases
UC.R2.22 --- Version Data Set
UC.R2.28 --- Manage Resource Metadata
UC.R2.38 --- Define and Use Resource Life Cycle
UC.R2.42 --- Define Resource Policy

The data curator has the ability to deprecate a selected data product or set of data products, retire it, or delete it. These words have particular meaning for data products. Deprecation means that the data should not be immediately visible to users but can still be found by general users (i.e. the default is that searches will not return deprecated data, but the option to do so can be selected by the user). Retirement means that a standard user should not be able to use (view or access) the data in any way, though the data and metadata are not fully deleted from the system, though select user types will be able to access them (such as system administrators). Retired data is likely to be moved to offline or 'dark' storage, so it is not immediately accessible even to system administrators. Deleted data are no longer kept within the system, though in an unusual case metadata might be persisted at the discretion of the deleter (as for example to indicate what happened to the data set). Data deletion is irreversible, and is intended for special cases as described below.

The data curator manually changes the lifecycle status of a selected product, in order to retire it. (See the last two steps of UC.R2.38 Define and Use Resource Life Cycle.) The data curator also can mark a data set for retirement. ( 1 ) The data curator works with the attributes of retired data products in the same way as active or deprecated data products.

Data may be deleted in the case that a data product or set of data products are egregious (perhaps a buggy process accidentally spawns a large series of spurious products) and not useful for science. In this case, the metadata should be deleted as well. Support for this kind of deletion is not expected to be common, and does not have to be built into the system, but should be possible in some form. [Ops]

( 2 )

Policies define the deprecation and retirement authorities for all OOI data. Particularly for engineering data, data from PI-owned instruments, and data from external data providers, policies determine whether and when the data provider will have control over retiring these data products. In general, the provider of external data can specify its deprecation, and request its retirement under appropriate conditions, e.g. if their data are found to be bad in some way. However, before performing retirement and deletion OOI must consider several concerns: how down-stream products are handled, the needs of current and past users of those data, usage of the data, and the policies and principles in the OOI Data Management Plan. (See UC.R2.42 Define Resource Policy.)

Release 3

Verify Metadata

Relevant Use Cases
UC.R2.28 --- Manage Resource Metadata
The data curator will define and manage metadata verification procedures. The exact procedures will be developed in a separate document, but will include the following processes:
  • Identify metadata standards and specifications to be (a) supported and (b) enforced by the Integrated Observatory. (Enforcement criteria must meet the approval of the CI Project Manager and Ocean Leadership.)
  • Perform metadata verification, including verifying that mandatory elements are present, that contents match Controlled Vocabulary terms when required, and that contents are in appropriate formats (integers in integer fields, etc.). Verification may optionally include bounds tests on metadata, such as the requirement that latitude and longitude bounds be within reasonable ranges.
  • Correct metadata describing data products, subject to data provider review.
  • Establish verification levels appropriately for data products in different data curation categories. For example, the set of mandatory elements may be larger for core OOI science data products than for engineering data or data provided by an external partner.
  • Establish when verification takes place.
  • Define errors and appropriate actions to take when metadata fails verification.
  • Request a change the Required status of resource attributes, i.e. from optional to required.
    The method and thoroughness of the verification procedures depend on the capabilities implemented for ingesting data.
The data curator has the authority to reject data if metadata meeting OOI criteria is not provided. An individual presenting a data set is provided a form with the errant metadata fields highlighted, and can make corrections on the spot or in a subsequent submission. An organization submitting large amounts of data is offered a report presenting the analysis of the data curator as to metadata issues, and agreements on metadata descriptions may be managed through additional avenues (especially if multiple data sets will be submitted over time).

Data providers and Integrated Observatory Operators can edit select data product metadata after data is submitted, consistent with policy. Any such edits are subject to re-verification according to the procedures described above, and may be reviewed by the data curator before taking effect. (See UC.R2.28 Manage Resource Metadata.)

( ) Appendix: Future Capabilities

These parts of the scenario are not likely to be available in Release 2, but are considered important for the final system.

( ) Manage proprietary holds

Relevant Use Cases
  • UC.R2.28 Manage Resource Metadata
  • UC.R2.42 Define Resource Policy

All data from OOI owned and operated instruments will be made publicly available as soon as feasible. However, in certain circumstances, as allowed under the NSF data policy, PIs may request a hold period during which they have sole access to the data. Data which have national security implications may be placed under permanent hold. When a legitimate hold is requested (which will be completed outside of the IO system through a written request by the PI that is approved), the Data Curator sets the Data Product Resource attribute for exclusive_rights_status to "temporary_hold" or "permanent_hold" and, for temporary holds, enters the appropriate exclusive_rights_end_date and exclusive_rights_contact and any needed exclusive_rights_notes.

The data curator defines policies by which data products that are derived from other data products with proprietary holds will as a default inherit the same exclusive_rights_status as the input data product, though the data curator will be able to override this. The user associated with an exclusive rights contact has the authority to make their held data public, but they do not have the authority to create a hold. Similarly, the data curator can define a policy whereby all of the data products from a particular instrument have a proprietary hold.

The data curator searches data products based on exclusive_right_end_date intervals to find those nearing release. The data curator follows a policy (to be developed) for messaging data owners when the data are nearing their hold end date, for example by sending a pre-defined message to the owner's email when the data are 2 months, 1 month, and 1 week from their exclusive_rights_end_date.

The Integrated Observatory system allows the Data Curator to develop policies and procedures to prevent the accidental distribution of data under proprietary hold. These include:

  • defining which users have the ability to view, download, and subscribe to data under proprietary hold. These permissions must be checked as part of all services that access data, to ensure that proprietary data are never viewed or exported without appropriate permission.
  • creating alerts (e.g. pop-up warning or similar mechanism) when data under proprietary hold are accessed (i.e. through subscription, downloaded, on-screen viewing, or any other exports or transfers out of the system), or when proprietary data are used as inputs to data process.
  • creating a log of all forms of access to proprietary data that can be viewed by the data curator and other authorized OOI staff. The exact content of the log is TBD, but it will at minimum include the datetime stamp, user name/ID, the ID of the data product, its exclusive_rights_status, and the nature of the access (subscription, download, etc.)
  • optionally creating a message to the data curator when data under proprietary holds are accessed. This will be configured so that messages are not created for PIs accessing their own proprietary data. The data curator will also optionally designate a set of OOI staff for whom messages are not created, so that routine data curation tasks on proprietary data do not create messages.

( ) Manage Data Contributions to National Data Centers

Relevant Use Cases
  • UC.R2.04 Browse to Get Data Product
  • UC.R2.24 Search for Resource
  • UC.R2.26 Navigate Resources and Metadata
  • UC.R2.28 Manage Resource Metadata
  • UC.R2.47 Define Executable Process

OOI is required to submit all data from OOI-owned and -operated instruments to the appropriate national data center (National Oceanographic Data Center (NODC) for water column data, and National Geophysical Data Center (NGDC) for seafloor and sub-bottom data). OOI submits data as quickly as is feasible after collection; NSF policy requires submission within 2 years.

Automated processes support transmitting data to the appropriate data center (e.g. using a defined OOI service, placing data in a designated FTP site). Those processes are developed in collaboration with the data centers. Independent of the data transfer mechanism, the data curator manages the contribution process in the following ways:

  • Define a policy and process(es) for data contribution. This will include:
    • specifying a time interval to trigger an export process (e.g. monthly)
    • criteria for selecting data products to export (e.g. data products where the archive name is null, data curation category is "core" or "ooi science non-core", exclusive rights status is unrestricted, lifecycle status indicates it is not deprecated, and it is of the specified level(s) (the processing levels the NDCs are interested in are yet to be determined); and
    • the process to convert data and metadata in the output format, including the creation of a log of all exports to an archive, and the update of the data product archive information attributes to reflect where and when it was sent to archive. The data curator will define this process in a DPS-like document, to be implemented by developers.
  • The data curator can modify the export process directly within the Integrated Observatory, e.g. to adjust the format, or include additional metadata elements as metadata standards change through time.
  • Manually search data based on archive status, archive name, date of archive, as well as other data attributes (processing level, data product name, etc.)

Labels

r2-acceptancescenariodetail r2-acceptancescenariodetail Delete
acceptancescenario acceptancescenario Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.