Skip to end of metadata
Go to start of metadata
This page describes details related to OOINet data processing. See the parent page for context.

Data Processing Overview

The figure below shows a detailed overview of the OOI marine integration with data flows, including data processing

Figure 1. OOI Data Processing Data Flow (OV-1)

See also:

Concepts

Resource Domain Model

The figure below shows a model of the resource types and their association relationships as relevant for data product generation. See below for details about these resources.

Figure 2. Resource Domain Model (OV-7)

Related Resources

  • Device (InstrumentDevice, PlatformDevice): one physical resource (device), such as an instrument or a platform
    • AKA "asset" in marine op speak
    • association to model exists
  • Site: base type for
    • InstrumentSite, PlatformSite: named location of a device of a certain model
      • association to model exists
      • top level platform site AKA "station" in marine op speak
      • lower level sites AKA or related to "port" in marine op speak
    • Subsite, Observatory: geospatial location (area) associated with Marine IO resources
      • AKA "array", "site", "subsite" in marine op speak
  • Model (InstrumentModel, PlatformModel): a distinguishable type or class of device controller
    • not fully equivalent to "make/model" in user speak
  • Agent (InstrumentAgent, PlatformAgent): a type (or version) of agent/driver suitable for a certain model
    • association to model exists
  • AgentInstance: configuration parameters for a future or current executing agent
    • association to agent (type) and device exists
  • Deployment: references one site and one device with additional temporal constraints and attributes (such as which port on a platform)
    • associations to Site and Device exist
    • specialized by type of deployment
  • DataProduct: a searchable, downloadable volume of data that can be consumed in form of a real-time stream, via replay stream or via retrieve or download. As with all resources, has unique characteristics described in form of attributes and an owner
  • DataProcessDefinition: defines transform function and constraints on possible input streams and describes produced output streams
    • association to ProcessDefinition exists
  • DataProcess: links input streams (subscriptions), and output data product. Provides name binding (input stream parameter names to names used in transform function)
    • association to DataProcessDefinition exists
    • may be public and always producing on behalf of the system, of on-demand by a user for a real-time subscription
  • Dataset: represents a coverage containing parameter contexts and values with its identifying attributes
  • ParameterContext: definition of a parameter (e.g. science variable) with its type and additional attributes
  • ParameterDictionary: a collection of ParameterContext
  • Stream: a named "channel" of messages that producers send messages to and consumers can subscribe to
  • StreamDefinition: defines the contents of the stream as granules using a ParameterDictionary and additional attributes
  • DataProducer: resource associated with a Device, external DataSource, DataProcess,
  • ParameterFunction: represents an expression with free variables that can be evaluated in a named environment (supported: Python, Matlab) and may depend on free variables resolved by associated ParameterFunction
  • TransformFunction: represents the expression applied within a DataProcess, e.g. a snippet of Python code

Related Services

Design

Design Assumptions

  • Instrument agent produces granules for one device on 1 raw, 1 parsed and 0..n engineering streams specific to the device as given by StreamConfiguration
  • Dataset agent produces granules for available device files (according to IDD agreements) as granules on 1 parsed stream per device
  • One StreamDefinition for each agent instance stream, with a ParameterDictionary
  • One SimplexCoverage or ComplexCoverage per stream, e.g. for raw, parsed and engineering streams, following the StreamDefinition
  • One Dataset resource per SimplexCoverage or ComplexCoverage
  • One StreamDefinition for each L0, L1, L2 DataProduct referencing the parsed ParameterDictionary and a filter of parameters
  • A ViewCoverage is created dynamically for L0, L1, L2 DataProducts as defined in the DataProduct's StreamDefinition and referencing the parsed Dataset (coverage)
  • For every DataProduct, a real-time Stream may exist
    • If existing, streams typically provide content for exactly ONE data product
  • For every data stream containing independent parameters (e.g. device raw, parsed streams) a dedicated ingestion worker exists if metadata/data persistence is on
    • Ingestion workers write exclusively to a coverage
    • can be exclusive for multiple coverages to reduce the number of workers (TBD: manually or automatically assigned)
    • can receive granules from multiple streams (e.g. device real-time stream, recovery stream)
  • The coverage for a stream is determined via the lookup path Stream -> DataProduct -> Dataset = Coverage

Design Decisions

  • One raw and 1..n parsed DataProducts per device
  • One raw and 1..n parsed streams for real-time data for capable devices
    • Raw only if instrument agent produces granules, not a data agent parsing IDD data files
  • One parsed stream for data recovery (via CG IDD) per device
  • One DataProduct per InstrumentSite x DPS x level as given in SAF (containing a, b, c sublevels as parameter sets)
    • Similar for PlatformSite with engineering products and derived products, but not from SAF
    • From SAF "Data QC Lookup Tables (With DP Levels)" report, e.g. "DOCONCS (L0) for CE01ISSM-MF004-01-DOSTAD999" for an InstrumentSite
    • Sublevel A contains: Science variables (parameters) together with domain parameters (timestamps, lat/log)
    • Sublevel B contains: Sublevel A parameters and automated QC flags for all timesteps where science values are available
      • QC flags represented as one compound parameter with atom params for each flag
    • Sublevel C contains: Sublevel A and B parameters and manual QC flags for all timesteps where science values are available (some manual QC flags may not exist
  • Provide the selection of A, B, C at time of download, not as separate DataProducts
  • One Simplex/Complex Coverage per Deployment (=one device, one set of geospatial/temporal constraints) for parsed "dataset" - one Dataset resource
    • One more for raw (unless it can be represented in the first)
    • One more for stream coverage (unless there is another way of persisting stream content)
    • Q: if the stream is the same across Deployments for one device, how to determine the target deployment (think latency!)
  • For one Site there is one DataProduct (for each DPS - level combination), retrieve the coverage metadata (domain, parameters) and values as a list for each primary deployment
  • Give up site multiplex transform
  • Don't execute data processes for data products that are not immediately needed
  • DataProcesses still used for matplotlib images
  • For every DataProduct without a public real-time stream, a dedicated stream can be obtained using a (complex) Subscription
    • internally a dedicated DataProcess (transform) is spawned and terminated

Simplified Design for OOINet Release 2

Figure 3. Simplifications in resource and coverage dependencies for first site deployment

Data Process Management

The Data Process Management Service (DPMS) orchestrates process definition and management at an application level. DPMS define_data_process method will receive the data product(s) that will be inputs into the algorithm, the data product(s) that will store the output of the process and the script that is the process code. This information will be persisted in a data process resource and relayed to the Data Transform Management Service (DTMS). The DTMS in the Data Management subsystem, will orchestrate, with PubSub Management Service and the Process Management Service in the CEI subsystem to host the script in an executable environment, connect the input and output streams and other low-level configuration to enable the process to execute.

The Data Process Management Service will manage external facing characteristics of the data process. Details such as when an individual process was executed or other process metrics will be requested from the Data Transform Management Service via DPMS. If the definition of the data process is modified, the existing processes and subscribers to the data process events are notified.

Figure 4. Data Process Registration (OV-6)

The figure below shows the flow and responsibilities across subsystems to terminate a data process.

Figure 5. Data Process Termination (OV-6)

Data Process Associations

Figure 6. Data Process Associations

Resource Life Cycle

Data Process Definition

Assumptions or Order of Events

  1. Preloaded DPDs are qualified and set immed to Deployed/Available
  2. Data Process Definitions are in Planned when the scientists is creating the algorithm, developer is coding and DPS spec is being reviewed.
  3. After the DPS is reviewed and the code is completed, the DPD moves to Developed
    1. Attachments may include signed DPS review spec
  4. After the DPD passes QA tests the DPD moves to Integrated
    1. Attachments should include tests results or log
  5. When the DPS egg is loaded into the system via the DataProcessMgmtSvc:register_process op, the DPD moves to Deployed
    1. source code is available in the resource
    2.  
Data Process Def Private Discoverable Available
Draft
  • Resource exists in the RR
   
Planned
  • Resource exists in the RR
   
Developed
  • All of the above
  • Critical attributes are define to fulled characterized the agent
  • May be assoc with signed DPS review spec attachment
   
Integrated
  • All of the above
  • May be assoc with QA test results
  • recommend move to Discoverable
out:
  • not assoc with a Data Process
 
Deployed
  • All of the above
  • assoc with certification results attachment (logs, etc) disabled for beta testing
  • assoc with manifest file attachment
  • DPSimpl has been deployed into CI system via IMS:register_driver (egg is available)
  • recommend move to available
  out: - error if assoc with data process
Retired
  • no assoc with data process
  Must transition out of Available
Current Impl Logic
  • First, the InstModel and InstAgent are defined
  • A DataProducer is registered with DataAcquisitionMgmtSvc via register_instrument, register_process or register_external_dataset, this creates the DataProducer Object and assoc with the producer resource.
    • producer may not be deployed yet so DataProducer rsrc should be set to same LCS as the producer
  • Parsed and raw data products are defined
  • assign_data_product is called to connect the raw and parsed data products to the instruments
  • Data Process Definitions are created and their output Data Products are created for a DPDefs
    • Streams are created when the Data Products are created
    • the Data Products are set to 'persisted' so Data Sets are created to persist the Streams.
  • Data Processes are created, the create takes the input and output Data Products
    • create processing sets up Data Producers

Background Information

Transform Functions and Data Processes

Figure 7. Background: Transform Functions and Data Processes

References

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.