- Local Install
- Setup for Virtual Machine
- Prepare Directory Structure
- GIT Setup
- Fork the Marine Integrations Repository
- Download Marine Integrations Repository
- Initialize the Environment
- Start New Dataset Driver
- GIT Commands and Branching
- Submodule Updates
- Configure Logging
- Data Types
- Exception Handling Info
The following presentation provides an overview or OOI and describes how Dataset Drivers fit into the code infrastructure.
If you have not received a login as part of the sign-up process, send email to helpdesk-at-oceanobservatories.org to get a login. This process will give you access to the Confluence wiki, Jira, FishEye and the other Atlassian tools via an LDAP server.
We run an XMPP/Jabber chat server (an ejabberd daemon), connected to our LDAP server. You should be able to join as soon as you have a Confluence login. Instructions for clients are at Jabber client setup.
Mac users should use Messages or Adium. On Windows, Pidgin is a very usable client.
If you will be running from a virtual machine, you need to get VPN access to get to that virtual machine. If you have not received this and need it, you can send an e-mail to email@example.com. An RSA token will be sent to you.
You will need a GitHub account because this is where the repository for the MI code is. You can get a free account online at github.com.
If you are using a virtual machine, you will need to install Global Protect, see instructions: VPN Installation Instructions.
To access the virtual machine, you will need to install a ssh client. Putty is a good option, but any one can be used.
Pycharm 2 is the preferred IDE for the Mac environment. See Working with Pycharm
All of the following editors are currently in use, and are all acceptable:
- Komodo. Note there is also a free editor only version with Komodo. Use whatever makes you happy.
- Emacs (Python Tutorial)
- Netbeans (Python Tutorial)
- Eclipse (see Eclipse as a Python IDE by BrianF)
- Geany (X11, GTK2, cross-platform).
- Wing IDE is pretty clean, snappy, X11-based, and cross-platform with a decent debugger...but is python only.
- Pycharm, which has VIM/Emacs modes for those who are so inclined. The group now has an open-source license for PyCharm, so it's free to use.
First SSH into your Virtual Machine, which will provide you with a linux terminal for your VM. Follow the instructions below to setup your VM.
You will need one directory for you code, and one for your virtual environments. A virtual environment is not the same thing as a virtual machine, it is a place to store a particular environment setup.
$ mkdir Workspace
$ cd Workspace
$ mkdir code
- Generate an SSH key on your VM using
$ ssh-keygen -t dsa
Do not enter a filename, just hit enter when it asks for a file to save the key. GitHub expects the default file name.
- Edit the ~/.gitconfig and change your full name (such as John Doe - please use full name and capitalization) and email address. This file may not exist; create it if it doesn't.
name = <your name>
email = <your email>
branch = auto
diff = auto
interactive = auto
status = auto
- Upload your SSH public key (located at ~/.ssh/id_dsa.pub) following instructions on GitHub, if you have not done so.
- You are set to fork any repo at OOI-CI source code repositories on GitHub. You can submit a pull request when you have a code delivery. The fork and pull request model is preferred so your code can be properly reviewed by your subsystem lead before merged.
You can check that this worked by verifying you can navigate to github.com/<your git user name>/marine-integrations.
$ cd ~/Workspace/code
#Check out your fork read/write
$ git clone firstname.lastname@example.org:<yourgituser>/marine-integrations.git
# the next step must be in the repository directory
$ cd marine-integrations
# Add the upstream feed for the master repository
$ git remote add upstream git://github.com/ooici/marine-integrations.git
Setup tools and numpy may already be installed on your virtual machine, you can try skipping those steps.
This setup will take a while to run, ~ 20 min.
$ cd ~/Workspace/code/marine-integrations
$ pip install -U setuptools==0.8
$ pip install numpy==1.7.1
Locate the IDD for your dataset driver and make sure you understand it. Then you will need to create new files for a parser, parser test, driver, and test driver.
To start on the parser, go to the mi/dataset/parser directory and create a new parser file, with the same name as the IDD (all lower case separated by underscores, that may not be the case in the IDD name).
To test your parser, make a new file in the mi/dataset/parser/test directory with the same name as your parser file, pre-pended by 'test_'.
The driver path describes the top directory structure where you driver will be generated. We will follow the convention of <Instrument and Series>/<Path To Data>.
For instance a driver with the identifier ctdpf_ckl_wfp_stc, the ctdpf_ckl is the instrument and series, and wfp_stc is the path to the data. So the entered driver path would be: ctdpf_ckl/wfp_stc.
See the GIT Help Page
Occasionally things will change in the extern modules which mi links to which will need to be rebuilt. This is done by running the following:
$ git submodule update
$ python bootstrap.py
There are several levels of logging messages which are output from MI code. By default, the MI logger is set to INFO. If you want to see debug messages, you will need to change the logging level by editing the file res/config/mi-logging.yml, which is a symbolic link to extern/ion-definitions/res/config/mi-logging.yml.
The following levels, listed in ordered from producing the least amount of output to producing the most verbose amount of output, are available:
DO NOT COMMIT YOUR CHANGES TO RES/CONFIG/MI-LOGGING.YML
See the Dataset Driver Testing Tutorial: Dataset Driver Testing Tutorial
Each data particle has an internal timestamp associated with it, which is specified separately from any timestamp fields returned in the data. The internal timestamp of the data particle must be in NTP64 format. There is a built in python library, ntplib which can be used to do conversions to ntp64 format. One common conversion is between seconds from Jan 1 1970 (UTC time) to ntp64, which can be done with the following command:
The internal timestamp can be passed in as the last argument to _extract_sample in the parser if needed, however preferably it is set using set_internal_timestamp method in the particle class as part of _build_parsed_values
In the dataset driver, we are working in python which only has 4 numerical data types: int, long, float, and complex. In the data particles though parameters are defined with numpy data types, which can specify a specific number of bytes associated with each data type. The following chart gives the appropriate python data type to encode your parameters based on the numpy data type (as defined in the data particle parameters).
|numpy data type||min||max||python data type|
|float32|| 6 significant decimal digits precision
|| 9 significant decimal digits precision
|float64|| 15 significant decimal digits precision
|| 17 significant decimal digits precision
The conversion from python to numpy data types is done later, outside of the dataset driver.
Sample Exceptions in dataset agents generally occur in the parser or particle class in the dataset driver. These exceptions are translated to ResourceAgentErrorEvent which is published to the dataset agent. When one of these sample exceptions is caught there are options in the main parsing thread; either stop the file processing if we can no longer confidently generate reliable data particles or we can continue to try parse records because we can still parse data with high confidence of valid data. It is possible to get encoding errors when creating particles. By default any any unhandled SampleExceptions in the parser will kill the parser thread and raise a ResourceAgentErrorEvent to the agent with the exception string as the payload.
A SampleException should be raised anytime we have reasonable chance that the data we are parsing are not the correct blocks of data that we should be parsing. It is important to note this is not data validation, but used to ensure that the bytes we are parsing are framed correctly.
In the parser thread it is quite likely that you will want to continue when a SampleException is raised in the parser. The driver writer must explicitly catch the error and call the _sample_exception_callback method explicitly. A warning message should also be logged.
This is the default behavior when sample exceptions are raised. The driver base class catches the exception and raises the ResourceAgentErrorEvent to the agent.
When we can no longer generate particles reliably the exception should be non-recoverable.
There are a few standard validation points where SampleExceptions may be raised. This could happen when validating the file, record or parameter. Unless these are explicitly handled in the parser they will kill the parser thread when raised.
Validation in this case is simply used to ensure we have confidence that what we are parsing is the correct blocks of data. It is NOT data validation. These validation sequences should be the minimum required to ensure high confidence that the data is framed properly. This is particularly relevant with binary files, but also applies to text files.
These validation schemes should ultimately be described in the IDD for the dataset agent.
File validation can be used when prior to starting record iteration. This step could verify file checksums, do byte counts, verify file headers, etc. It is recommended that this only be used when there is no way to validate individual records as this is generally a non-recoverable exception.
This exception can either be a recoverable or non-recoverable exception based on the data file. Record validate should only contain the minimum validate required to ensure we are parsing the correct bytes of data. If we are uncertain we are looking at a valid data record block then a SampleException should be raised and the block is not parsed as we are fairly certain that we our data is not validly parsed.
For example; assuming we have a binary block that is 10 bytes long and has a sentinel sequence 0xAABB with a two byte checksum. Our validation scheme might be are the first two bytes correct and does it pass the checksum? If not then raise an exception.
It is best to use the _encode_value method when encoding parameters. This will handle all the exception handling for us automatically. Otherwise you should raise a SampleEncodingException if you detect a exception when encoding.
By default sample exceptions should be recoverable exceptions and this behavior is already built into the base driver class. Values that fail encoding will be added to the particle as None.
Base class for all SampleExceptions. All sample exceptions are caught in the base driver class and raised as ResourceAgentErrorEvents. This exception can be raised from the parser class or the particle class when the error is non-recoverable.
When you detect unexpected bytes of data in your file raise this exception. It is important that we examine every byte in a file and handle it in some way, even if we explicitly ignore it.
Record validation occurs in the data particle. If that record validation fails then this exception should be raised if we can continue parsing the file reliably. Otherwise, for a non-recoverable failure raise a SampleException.
If an exception is raised when attempting to encode a value in the particle it should be caught and this exception should be raised.
It is impossible to detect all the ways our parsers may fail. When we do encounter these failures an human is best equipped to determine the cause of the error. Has the format changed? Is the file corrupt? Is there a bug in the driver? Once determining the cause of the problem the operator can create a supplemental data file that contains the records not ingested. Then the file is dropped into the ingestion directory and will be re-ingested.
While this information isn't particularly relevant to the driver writer it's good to understand how gap data can be recovered.
During normal operations we may discover for some reason the validation sequences are too rigid. In that case we would update the validation scheme in the driver and follow the same mitigation plan listed above.