Version 6 by Emily Hahn
on May 23, 2014 11:41.

compared with
Current by Emily Hahn
on Nov 06, 2014 15:58.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (21)

View Page History
The sio_mule_common.py dataset agent parser is a common parser which can be used to parse data for all instruments which have been wrapped in an SIO header (see [sio_mule_common|https://confluence.oceanobservatories.org/display/instruments/sio_mule_common] for a description of the header).  

h2. State
h2. Recent Changes

This parser contains a more complicated state than just the usual file position because of the fact that there is one single large file, and the fact that this file may be contain blocks of zeroed out data which will may appear filled in later.  
The file format that is being parsed by the sio mule common has changed from a single file which changes over time to a set of fixed files (similar to how recovered sio parsers worked previously).  This was changed because the new architecture does not handle file state.  The original file is still created, but keeping track of the state has moved into a new program which will be run on the shore server, and keep track of the state there.  The new files are separated by instrument group.  It is instrument 'group' rather than just instrument because there may be multiple sio header blocks associated with a parser, so they are grouped into files by parser.  For example the ctdmo_ghqr_sio parser receives both 'CT' and 'CO' sio header IDs in its .ctdmo.dat files. The reversing of escape characters (\x18\x6b \-> \x2b and \x18\x58 \-> \x18) has also been moved to the program on the shore server, so this is no longer required by the parser.  

To keep track of which areas of the file have been evaluated, and which areas need to be revisited, there are two different fields in the state, IN_PROCESS_DATA and UNPROCESSED_DATA.  
h2. Migrating a Parser

Both of these contain the an array of integers of the start and end byte indices of a section of the file.  If a section of data is matched to an sio header in the sieve_function, the start and end indices of that section are appended to the IN_PROCESS_DATA.  UNPROCESSED_DATA is initialized to the start and end indices of the file, and has sections removed as they are processed.  Processed means that the block of data has had all the samples within that block read from the file and returned through get_records().
To migrate a parser from using the old sio parser to the new one, the following should be performed:

For example, for a file of length 100, the starting state would be:
If the driver is one that is located in the mflm directory, we are moving the files to the locations matching the IDD names.  For example mflm/adcp is now moved to adcps_jln/sio.  This just requires moving any \__init__.py files and any files still needed in the resource directory to the new location.  This may also require renaming the classes in the parser to match the IDD name.  

IN_PROCESS_DATA: \[\], UNPROCESSED_DATA: \[0, 100\]
If the driver is not in the mflm directory, no copying or renaming is needed.

If a block of data was found which matched the sio header starting at 25 and ending at 75, and that block was processed, the state would become:
The resource files may no longer be relevant due to the file format changing (nodeXXp1.dat files are no longer used, this has gone to nodeXXp1_N.instrumentgroup.dat).  The new file format can only be found in the IDDs currently, not on the acquisition server.  It is possible that the data that was used to generate the files in the new format may have samples matching the old data file, so check if any of the .yml files can be reused.

IN_PROCESS_DATA: \[\], UNPROCESSED_DATA:\[\[0,25\], \[75,100\]\]
The sio common class is now named 'SioParser' which is used by both telemetered and recovered parsers.  The 'SioMuleParser' class no longer exists.  The new SioParser class no longer handles state.  The only input arguments required are config, stream_handle, and exception_callback now, so this needs to be changed in the classes extending SioParser.  In parse_chunks, the chunk_sample_count is no longer used, so this needs to be removed from existing parsers.

These start and end indices include the SIO header itself, so they start at the \x01 and end at the \x03.  
The unit tests should have all tests handing changing state removed (starting in the middle, changing state, starting and stopping).  If this is one of the mflm parsers, the import paths and class names will need updating in the tests too. The existing non-state tests for telemetered need to be updated to use the new example file from the IDD, and test comparing to .yml files rather than other methods.  The data in the new format of files may match up to some data in the old file, so it may be possible to reuse some of the created .yml files.  The recovered files were already in a fixed format and can remain.  To transition the recovered tests it helps to look at the old driver test to determine which input test files match up to the output .yml files.  

The IN_PROCESS_DATA is used to keep track of the state of samples that have been read from the file, and if they have actually been returned or not.  There are two additional integers after the start and end indices in IN_PROCESS_DATA, the number of samples that have been parsed from the file within this sio header block, and the number of samples that have actually been returned through get_records().  This allows the parser to stop at any point, even if there is more than 1 sample within the sio header block, and only return the samples that have not been sent previously.  

For example, say there were 3 samples within the sio header block from 25-75 in the example above.  If you just requested one record in get_records(), you would have a state:

IN_PROCESS_DATA: \[25, 75, 3, 1\], UNPROCESSED_DATA: \[0, 100\]

The 3 indicates that 3 samples were parsed within that block, and the 1 indicates that only 1 of these has been returned.  If you requested another samples in get_records() then the state would be:

IN_PROCESS_DATA: \[25, 75, 3, 2\], UNPROCESSED_DATA: \[0, 100\]

After requesting in the third sample all of the samples in the IN_PROCESS_DATA sio header block from 25-75 would be fully processed (i.e. actually returned to the requester), so it is removed from in_process_data, and unprocessed data is updated:

IN_PROCESS_DATA: \[\], UNPROCESSED_DATA:\[\[0,25\], \[75,100\]\]

The in_process_data takes precedence over the unprocessed data, so first the parser will look in the in_process_data sio header blocks for samples, then when there are no more in process data blocks it will look in the unprocessed data blocks to see if any additional data has been filled in.  

If the parser is restarted with a state that has an in process data block, the parser will throw out the number of samples that is has already returned (which is the last item in IN_PROCESS_DATA), and start with the following sample.

h3. Sample Count

The third item in IN_PROCESS_DATA is filled in using the \_chunk_sample_count list.  This list is used to count samples as they are parsed from the file in parse_chunks, where each item in the list represents the number of samples in one sio header block.  When the state is incremented, this information is moved into the third item in IN_PROCESS_DATA.  This list is generated in each individual parser, usually in parse_chunks.  The \_chunk_sample_count list must have one item appended for every chunk even if no samples are found.  This allows the corresponding blocks to be marked as having no samples and considered processed, which ultimately removes that set of indices from in process and unprocessed data when the state is incremented.  If samples are found this keeps track of how many samples were parsed within that chunk.

For instance, if you were looking for 'AD' instrument records, and a file had a sequence of 'PS', 'CO', 'AD', 'DO', 'AD',  records each with one sample in them, because we are only looking for 'AD' records the chunk sample count list would be \[0, 0, 1, 0, 1\]

If you were looking instead for 'DO' instrument records in this same file, and the DO record had 3 samples in it, the chunk sample count would be \[0, 0, 0, 3, 0\].

h2. Modem escapes and data size

Before blocks of data are added to the chunker, a replace is done on the data to replace \x18\x6b with \x2b and \x18\x58 with \x18.  This means that the size of the data will actually change, removing a byte for each replace. Therefore the indices of data blocks in the file may not be the same as the data blocks in the chunker, so if you are trying to determine the indices of IN_PROCESS_DATA and UNPROCESS_DATA blocks for comparison in your tests, remember to adjust for the escape sequences.  

h2. Recent Changes

The common set of code required some changes to fix some bugs and make it a little more readable:

The escape character replace was corrected to actually replace the characters, this was previously broken.  In order to handle indexing with the in_process_data and unprocessed data blocks, an additional change was made which reads in the entire file (in small blocks to not block the processor) and stored in a data buffer, self.all_data, which has the escape character replace performed on it.  Then data is accessed in the buffer rather than the file, allowing the indices in the state to line up.  

The timestamp state key was removed because it was not actually helping to keep track of the state.  The use of self._timestamp is also not necessary, if you don't set it to a value nothing bad will happen.  

ID constants are now used in place of \[0\], \[1\], \[2\], \[3\] to index into the in process and unprocessed data blocks to improve readability:

 \[0\] \-> \[START_IDX\], \[1\] \-> \[END_IDX\], \[2\] \-> \[SAMPLES_PARSED\], \[3\]->\[SAMPLES_RETURNED\]

A recovered_flag argument was added which allows the escape characters to be turned on and off. 

These changes have been merged into ooici, and the mflm instruments which use this file updated.  The ctdmo has been updated to also use the new multiple harvester dataset driver, with no parser defined yet for the recovered data.
For an example, the adcps_jln/sio and dosta_abcdjm/sio parsers have been migrated and are committed into oceanobservatories (as of 11/6/14).