Receive and process data that is an addition to existing data set.
|Actors||Data Provider, any user (Anonymous Guest)|
|Is Used By||UC.R2.29 Integrate External Dataset|
|Extends||UC.R1.04 Ingest and Describe Data|
|Is Extended By|
|In Acceptance Scenarios||AS.R2.01A Operate Marine Observatory, AS.R2.04A Data Product Leads Drive Core Data Product Creation|
|Technical Notes||Most of the basic use case was implemented in R1; this use case extends the available capabilities.
Different types of data will need different marshaling algorithms, which will be achieved as part of this use case.
|Primary Service||Ingestion into Common Data Format|
|UC Status||Mapped + Ready|
This information summarizes the Use Case functionality.
Identify which is the "existing data" resource to which new data are to be integrated. Convert the supplement to the canonical ION form. Establish and implement ordering (e.g., time) if appropriate — the supplement may or may not be 'at the end' of the previously received data. Update all dynamic metadata associated with the data stream. Make the supplement available to subscribers in suitable forms (with metadata as appropriate).
- A data set already exists that will be supplemented by incoming data.
- The supplement will arrive in a way that makes clear its association with the existing data set. This may be done through metadata. If that is not an option, an agent must derive the association from contextual information (the communication channel, information within the data stream, or another technique) and convey it to the ION system as an explicit relationship.
- If the data provider does not provide the supplement separately, it is possible for ION to determine the part of the data set that is uniquely the supplement, and to ingest only that part of the data.
A data set exists, and a supplement to it is about to be made available to the Integrated Observatory.
- Data Provider or observatory data source provides the Integrated Observatory a data supplement with associated metadata.
- The data supplement is provided to the Integrated Observatory via an internal or external data source. A data source agent may actively detect availability of new data or wait for the data source to trigger new data input.
- Data is published as a series of messages on a stream (internally, organized by 'topic') from an instrument or external source.
- The Integrated Observatory analyzes the associated metadata in the received supplement and updates Data Set metadata in its catalogs
- For instance update the last updated time, available time steps and 'area of interest' metadata.
- Dynamic metadata describing this information resource is updated: For example, depth, latitude, and longitude of a moving platform varies over time, and the metadata of the stream should reflect the information resource's entire history
- <2> Duplicate data packets must be detected and marked.
- Duplicates are indicated by identical timestamp and content hashes for the data record. (Identical content hash implies identical record size and byte-wise content.)
- Duplicates can arise from transmission duplication (intentional and accidental), and from operator duplication (typically accidental) on upload of artifacts.
- Later releases will manage the duplicated data to optimize the users' experience.
- Users may view supplement that has been published to the stream, per policy.
- In most cases the data supplement, and corresponding data arrival messages, are presented as user-visible data products, allowing them to reach subscribers or other external viewers.
- Listeners subscribed to the data supplements directly get them as they arrive. For some data streams, this can occur
- The Integrated Observatory may persist the data supplement
- This is based on policy: Information from non-OOI systems need not be stored, only cached for a period
- Persistence is maintained in association with the Data Set
- If data are not persisted, the Integrated Observatory will retain the information necessary to retrieve the data if needed in the future, if that is supported by the Data Source
- The Integrated Observatory tracks the received supplements, and supports their navigation by browsing, including finding the first, previous, next, and last (current) supplements of a resource.
- In 'data order', "current supplement" refers to the last supplement of the series ordered by time, not necessarily the most recently received (because supplements may arrive out of order).
- In 'arrival order', "current supplement" refers to the most recently received supplement.
- Ideally, both views can be supported (it will be necessary to navigate data by both arrival order, and by 'natural' data order). In Release 2, supporting just one is sufficient (data order preferred, either is OK).
- <2> Historical results can be requested and provided 'by supplement'.
- Note this implies keeping track of the granule that formed each supplement, in order to provide them as they were originally provided.
- Even if this user capability isn't supported in Release 2, keeping track is necessary to support it later.
The supplement has been ingested, categorized, presented to users, and persisted, as appropriate.
These comments provide additional context (usually quite technical) for editors of the use case.
Not all of the following subtleties will be addressed in Release 2, but they are important to understand in designing the Release 2 system.
The fundamental definition of what makes a given supplement a part of an existing data set, or the beginning of a new data set, has not been declared. The Product Manager suggests the following characteristics are potentially keys to uniqueness, and they all work for both sensor and observing system data providers:
- data provider identifier (sensor/instrument, or system identifier for observatories)
- deployment identifier of data provider, including parent system (each time a sensor is put on a different platform, it is producing a new data set; if the sensor is removed from a platform and reinstalled, whether a new data set results is ideally at the discretion of the operator); for observing systems the equivalent would be organizational parent
- data record type identifier — an instrument may produce different data records (even intermixed), to include raw data records, summary data records, errors, and messages; these should not be lumped together
- data record version (if the data provider changes the content of the record, and now it doesn't contain 4 of the original parameters, that's a different data stream)
- data format identifier (if it isn't already incorporated in the data record type identifier or data record version); if the data format changes, even if it's the same values represented, the changed format has to be associated explicitly with the corresponding data records, to guarantee proper automated handling of the inputs
These criteria provide the opportunity for an instrument operator or provider to declare new data sets at any time, by creating a new data record version and corresponding data format identifier, or by declaring a new deployment. (Essentially either is a statement that the new data is to be considered different than the previous data.)
In some cases, changes to an instrument or observing system may suggest the new data should be in a different data set. Examples to consider include changes to temporal frequency, major configurations, and even calibrations. The data provider/instrument operator has the option to force the data into a different data set as described above. At the same time, the system must be able to handle these types of events — particularly calibrations — when they occur during a deployment and within a data set. So there is no single rule that will be ironclad for all system operations.
Finally, over time it will be useful to construct a virtual data set from two existing data sets. A simple example is to 'stitch' together data from two instruments that occupied the same observation role. Another is if the provider organization for external data changes, and possibly the service, but the actual data does not change, it would be useful to 'stitch' the before and after data sets together. In a slightly more advanced feature, we will want to be able to declare a composed data set from sections of other data sets is also quite valuable. Note that making these connections by declaration of references (i.e., indexing), not by transformation, provides the most value.