Skip to end of metadata
Go to start of metadata

Introduction

Ocean Observatories Initiative Overview

The following presentation provides an overview or OOI and describes how Dataset Drivers fit into the code infrastructure.

https://confluence.oceanobservatories.org/download/attachments/37502850/DatasetDriverInit2.pptx?version=1&modificationDate=1391711915101

Accounts

Confluence/LDAP and dev mailing list

If you have not received a login as part of the sign-up process, send email to helpdesk-at-oceanobservatories.org to get a login. This process will give you access to the Confluence wiki, Jira, FishEye and the other Atlassian tools via an LDAP server. 

IRC/XMPP chat

We run an XMPP/Jabber chat server (an ejabberd daemon), connected to our LDAP server. You should be able to join as soon as you have a Confluence login. Instructions for clients are at Jabber client setup.

Mac users should use Messages or Adium. On Windows, Pidgin is a very usable client.

VPN

If you will be running from a virtual machine, you need to get VPN access to get to that virtual machine.  If you have not received this and need it, you can send an e-mail to helpdesk@oceanobservatories.org.  An RSA token will be sent to you.  

GitHub

You will need a GitHub account because this is where the repository for the MI code is.  You can get a free account online at github.com. 

Local Install

VPN / VM Installs

If you are using a virtual machine, you will need to install Global Protect, see instructions: VPN Installation Instructions.

To access the virtual machine, you will need to install a ssh client.  Putty is a good option, but any one can be used.

Set up code editor

Pycharm 2 is the preferred IDE for the Mac environment. See Working with Pycharm

All of the following editors are currently in use, and are all acceptable:

  • Vim
  • Komodo. Note there is also a free editor only version with Komodo. Use whatever makes you happy.
  • Emacs (Python Tutorial)
  • Netbeans (Python Tutorial)
  • Eclipse (see Eclipse as a Python IDE by BrianF)
  • Geany (X11, GTK2, cross-platform).
  • Wing IDE is pretty clean, snappy, X11-based, and cross-platform with a decent debugger...but is python only.
  • Pycharm, which has VIM/Emacs modes for those who are so inclined. The group now has an open-source license for PyCharm, so it's free to use.
    • The PyCharm license is located on this restricted-access page: Pycharm License
    • See Working with Pycharm
    • If you need access to the license but are restricted from the page, contact any developer for assistance

Setup for Virtual Machine

First SSH into your Virtual Machine, which will provide you with a linux terminal for your VM.  Follow the instructions below to setup your VM.

Prepare Directory Structure

You will need one directory for you code, and one for your virtual environments.  A virtual environment is not the same thing as a virtual machine, it is a place to store a particular environment setup. 

$ mkdir Workspace

$ cd Workspace

$ mkdir code

GIT Setup

  • Generate an SSH key on your VM using

$ ssh-keygen -t dsa

Do not enter a filename, just hit enter when it asks for a file to save the key.  GitHub expects the default file name.

  • Edit the ~/.gitconfig and change your full name (such as John Doe - please use full name and capitalization) and email address.  This file may not exist; create it if it doesn't.

        [user]
            name = <your name>
            email = <your email>
       [color]
            branch = auto
            diff = auto
            interactive = auto
            status = auto

  • Upload your SSH public key (located at ~/.ssh/id_dsa.pub) following instructions on GitHub, if you have not done so.
  • You are set to fork any repo at OOI-CI source code repositories on GitHub. You can submit a pull request when you have a code delivery. The fork and pull request model is preferred so your code can be properly reviewed by your subsystem lead before merged.

Fork the Marine Integrations Repository

Log into github.com and browse to the marine-integrations repository.  Then click on the fork button in the top right corner.  

You can check that this worked by verifying you can navigate to github.com/<your git user name>/marine-integrations.

Download Marine Integrations Repository

$ cd ~/Workspace/code

#Check out your fork read/write

$ git clone git@github.com:<yourgituser>/marine-integrations.git

# the next step must be in the repository directory

$ cd marine-integrations

# Add the upstream feed for the master repository

$ git remote add upstream git://github.com/ooici/marine-integrations.git

Initialize the Environment

Setup tools and numpy may already be installed on your virtual machine, you can try skipping those steps.

This setup will take a while to run, ~ 20 min.

$ cd ~/Workspace/code/marine-integrations

$ pip install -U setuptools==0.8

$ pip install numpy==1.7.1

Start New Dataset Driver

Locate the IDD for your dataset driver and make sure you understand it.  Then you will need to create new files for a parser, parser test, driver, and test driver.

To start on the parser, go to the mi/dataset/parser directory and create a new parser file, with the same name as the IDD (all lower case separated by underscores, that may not be the case in the IDD name).  

To test your parser, make a new file in the mi/dataset/parser/test directory with the same name as your parser file, pre-pended by 'test_'.

Driver Path

The driver path describes the top directory structure where you driver will be generated.  We will follow the convention of <Instrument and Series>/<Path To Data>.  

For instance a driver with the identifier ctdpf_ckl_wfp_stc, the ctdpf_ckl is the instrument and series, and wfp_stc is the path to the data.  So the entered driver path would be: ctdpf_ckl/wfp_stc.

GIT Commands and Branching

See the GIT Help Page

Submodule Updates

Occasionally things will change in the extern modules which mi links to which will need to be rebuilt.  This is done by running the following:

$ git submodule update

$ python bootstrap.py

$ bin/buildout

$ bin/generate_interfaces

Configure Logging

There are several levels of logging messages which are output from MI code.  By default, the MI logger is set to INFO.  If you want to see debug messages, you will need to change the logging level by editing the file res/config/mi-logging.yml, which is a symbolic link to extern/ion-definitions/res/config/mi-logging.yml.

The following levels, listed in ordered from producing the least amount of output to producing the most verbose amount of output, are available:

  • CRITICAL
  • ERROR
  • WARNING
  • INFO
  • DEBUG
  • TRACE

DO NOT COMMIT YOUR CHANGES TO RES/CONFIG/MI-LOGGING.YML 

Testing

See the Dataset Driver Testing Tutorial: Dataset Driver Testing Tutorial

Timestamps

Each data particle has an internal timestamp associated with it, which is specified separately from any timestamp fields returned in the data.  The internal timestamp of the data particle must be in NTP64 format.  There is a built in python library, ntplib which can be used to do conversions to ntp64 format.  One common conversion is between seconds from Jan 1 1970 (UTC time) to ntp64, which can be done with the following command:

ntplib.system_to_ntp_time(time_in_seconds_UTC)

The internal timestamp can be passed in as the last argument to _extract_sample in the parser if needed, however preferably it is set using set_internal_timestamp method in the particle class as part of _build_parsed_values

Data Types

In the dataset driver, we are working in python which only has 4 numerical data types: int, long, float, and complex.  In the data particles though parameters are defined with numpy data types, which can specify a specific number of bytes associated with each data type.  The following chart gives the appropriate python data type to encode your parameters based on the numpy data type (as defined in the data particle parameters).

numpy data type min max python data type
int8 -128 127 int
uint8 0 255 int
int16 -32768 32767 int
uint16 0 65535 int
int32 -2147483648
2147483647
int
uint32 0 4294967295
int
int64 -9223372036854775808
9223372036854775807
int
uint64 0 18446744073709551615
long
float32 6 significant decimal digits precision
9 significant decimal digits precision
float
float64 15 significant decimal digits precision
17 significant decimal digits precision
float

 The conversion from python to numpy data types is done later, outside of the dataset driver. 

Exception Handling Info

Overview

Sample Exceptions in dataset agents generally occur in the parser or particle class in the dataset driver.  These exceptions are translated to ResourceAgentErrorEvent which is published to the dataset agent.  When one of these sample exceptions is caught there are options in the main parsing thread; either stop the file processing if we can no longer confidently generate reliable data particles or we can continue to try parse records because we can still parse data with high confidence of valid data.  It is possible to get encoding errors when creating particles. By default any any unhandled SampleExceptions in the parser will kill the parser thread and raise a ResourceAgentErrorEvent to the agent with the exception string as the payload.

A SampleException should be raised anytime we have reasonable chance that the data we are parsing are not the correct blocks of data that we should be parsing. It is important to note this is not data validation, but used to ensure that the bytes we are parsing are framed correctly.

Recoverable Exceptions

In the parser thread it is quite likely that you will want to continue when a SampleException is raised in the parser. The driver writer must explicitly catch the error and call the _sample_exception_callback method explicitly. A warning message should also be logged.  

Non-Recoverable Exceptions

This is the default behavior when sample exceptions are raised. The driver base class catches the exception and raises the ResourceAgentErrorEvent to the agent.

When we can no longer generate particles reliably the exception should be non-recoverable.

Validation Points / Actions

There are a few standard validation points where SampleExceptions may be raised. This could happen when validating the file, record or parameter. Unless these are explicitly handled in the parser they will kill the parser thread when raised.

Validation in this case is simply used to ensure we have confidence that what we are parsing is the correct blocks of data. It is NOT data validation. These validation sequences should be the minimum required to ensure high confidence that the data is framed properly. This is particularly relevant with binary files, but also applies to text files.

These validation schemes should ultimately be described in the IDD for the dataset agent.

File Validation

File validation can be used when prior to starting record iteration. This step could verify file checksums, do byte counts, verify file headers, etc. It is recommended that this only be used when there is no way to validate individual records as this is generally a non-recoverable exception.

Record Validation

This exception can either be a recoverable or non-recoverable exception based on the data file. Record validate should only contain the minimum validate required to ensure we are parsing the correct bytes of data. If we are uncertain we are looking at a valid data record block then a SampleException should be raised and the block is not parsed as we are fairly certain that we our data is not validly parsed.

For example; assuming we have a binary block that is 10 bytes long and has a sentinel sequence 0xAABB with a two byte checksum. Our validation scheme might be are the first two bytes correct and does it pass the checksum? If not then raise an exception.

Parameter Encoding

It is best to use the _encode_value method when encoding parameters. This will handle all the exception handling for us automatically. Otherwise you should raise a SampleEncodingException if you detect a exception when encoding.

By default sample exceptions should be recoverable exceptions and this behavior is already built into the base driver class. Values that fail encoding will be added to the particle as None.

Types of Exceptions

SampleException

Base class for all SampleExceptions. All sample exceptions are caught in the base driver class and raised as ResourceAgentErrorEvents. This exception can be raised from the parser class or the particle class when the error is non-recoverable.

UnexpectedDataException

When you detect unexpected bytes of data in your file raise this exception. It is important that we examine every byte in a file and handle it in some way, even if we explicitly ignore it.

RecoverableSampleException

Record validation occurs in the data particle. If that record validation fails then this exception should be raised if we can continue parsing the file reliably. Otherwise, for a non-recoverable failure raise a SampleException.

SampleEncodingException

If an exception is raised when attempting to encode a value in the particle it should be caught and this exception should be raised.

Exception Mitigation Strategy

It is impossible to detect all the ways our parsers may fail. When we do encounter these failures an human is best equipped to determine the cause of the error. Has the format changed? Is the file corrupt? Is there a bug in the driver? Once determining the cause of the problem the operator can create a supplemental data file that contains the records not ingested. Then the file is dropped into the ingestion directory and will be re-ingested.

While this information isn't particularly relevant to the driver writer it's good to understand how gap data can be recovered.

During normal operations we may discover for some reason the validation sequences are too rigid. In that case we would update the validation scheme in the driver and follow the same mitigation plan listed above.

Labels

s s Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.