Skip to end of metadata
Go to start of metadata

A logical component of a dataset agent used to periodically monitor for new source data file. Upon new file detection the file will be opened serially and a handle passed to the parser. On it's own an abstract harvester class doesn't do much of anything and it would be expected that harvesters are subclassed to handle different file formats and delivery mechanisms.  A key function of the harvester is to maintain state of files that have been imported to ensure data is not re-ingested into OOIN as a duplicate.

Single Directory Harvester

The single directory harvester will be used in cases where there is a single fixed directory which will have one or more files added to it.  Files are considered completely delivered once they have lived in the source directory unmodified for a specified amount of time, which can be specified in the configuration and defaults to 30 seconds.  To determine if the file has been unmodified for 30 seconds, the current time must be greater than 30 seconds past the file modification time, and the file must not have already been ingested as indicated by the ingested state field. 

Configuration

The harvester configuration specifies this directory where the harvester will search for new files in the 'directory' field, and specifies the file pattern in the 'pattern' field.  The pattern uses text matching, not regular expressions, and uses * to indicate any character / number / symbol.  For example any files with the extension .dat could be located with the pattern '*.dat'.  There are also optional configuration parameters of frequency and file_mod_wait_time, which default to 1 and 30 seconds respectively.  The frequency is the number of seconds to wait in between polling for files, and the file_mod_wait_time is the time after the file modification time to wait before a file is considered to not be changing anymore, and will be added to the list of files.  There was previously a concept to copy files to a storage directory when they are about to be ingested, but this was unable to be implemented, and is not used currently.  If this behavior is used in the future, the directory will be stored in the  'storage_directory' field.  This way if a file is in ingested, and then changes occur in the original directory, the file that was ingested still exists and can be compared with the changed version.  The changed version of the file will not overwrite the original ingested version.  If this is used in the future, the storage directory for each dataset driver must be unique.  If there is one file that contains multiple instruments in it, in order to maintain the ingested version of the file for each driver they must be stored separately, since they may be ingested in different states.   

Example:

State

We need to maintain state about all the files detected in the directory and parser state for each of those files.  The intent of the parser state is to ensure we can recover from a parser failure and partial ingestion. While the harvester state will likely be wrapped by class objects in the code, ultimately it will be stored as a serialized dictionary in OOIN via the agent persistence mechanism.  The state will be stored in a layered dictionary, with each file having a dictionary within the top level dictionary.  The key for each file sub-dictionary will be the file name.  There will also be a version field in the top dictionary, which can be used for backwards compatibility if there are changes to the harvester state in the future. 

Elements:

parameter description
file_name name of the file (not the full path, just the name), this will be the key for each file sub-dictionary
file_size size in bytes reported by stat
file_mod_date unix time in epoch seconds of file modification time
file_checksum calculated file checksum using md5 in python hashlib
ingested Boolean if ingestion of this file is complete
parser_state object specific to each parser representing parser state
modified_state (optional) if a file is modified after it has been ingested, the modified state will be stored here, and have fields: file_size, file_mod_date, file_checksum (same as described above)
version a top level version for this state, which can be used if the harvester state is modified in the future to handle backwards compatibility

Example:

Notifications

If there is a failure in parsing a file, the user will be notified of the error, and the driver will try to recover.  Once the driver recovers, another notification indicating its recovery will be sent.  Once this file is parsed, successfully or un-successfully, it will be marked as ingested.  The parsing failure must be handled manually.

If a file is modified after it has been ingested, the 'modified_state' parameter will be added to the state dictionary, containing the modified file size, modification date, and checksum.  To determine if a file has changed first the file size and file modification date will be examined.  If either of these are found to have changed, then the checksum will be calculated and compared with the stored checksum, and if these are different the file is considered modified.  When this occurs a notification will be triggered to alert the user that an ingested file has been modified.  If the file is modified another time, the information in the 'modified_state' parameter will be updated and a new notification sent.  This way the user is notified just once each time the file is modified.  Files that have been ingested and not modified will not have the 'modified_state' parameter.

Single File Harvester

The single file harvester monitors a single file for changes to that file, meaning that there will only be a single harvester item in the harvester state list.  

To determine if the file has changed, first the file size and file modification date will be examined.  If both the file size, or both the file size and modification date have been changed, then the file has been changed.  If only the modification date has been changed, the checksum will also be calculated and compared with the stored checksum.  

Configuration

The harvester configuration specifies the directory the file is in within the 'directory' field, and specifies the fixed file name in the 'pattern' field.  There are also optional configuration parameters of frequency and file_mod_wait_time, which default to 1 and 30 seconds respectively.  The frequency is the number of seconds to wait in between polling for files, and the file_mod_wait_time is the time after the file modification time to wait before a file is considered to not be changing anymore, and will be added to the list of files.  Previously in order to preserve the version of the file that has been ingested, this file was to be copied to a storage directory.  This directory is specified in the 'storage_directory' field.  Each time the file is ingested, the file in the storage directory will be overwritten with the most recent ingested version of the file.  This may be used in the future, but currently was unable to be implemented.  If the storage directory is used in the future, the storage directory for each dataset driver must be unique.  If there is one file that contains multiple instruments in it, in order to maintain the ingested version of the file for each driver they must be stored separately, since they may be ingested in different states.

Example:

State

We need to maintain state about the file and parser state for that files.  The intent of the parser state is to ensure we can recover from a parser failure and partial ingestion. While the harvester state will likely be wrapped by class objects in the code, ultimately it will be stored as a serialized dictionary in OOIN via the agent persistence mechanism. The key for each file sub-dictionary will be the file name.

Elements:

parameter description
file_name name of the file (not the full path, just the name), this will be the key for each file sub-dictionary
file_size size in bytes reported by stat
file_mod_date unix time in epoch seconds of file modification time
file_checksum calculated file checksum using md5 in python hashlib
parser_state object specific to each parser representing parser state

Example:

Notifications

This harvester will not provide a warning when the file has been modified, because this is expected behavior for this harvester.

A check will be performed to confirm that the filename in the state matches the filename in the configuration.  If these file names do not match an exception will be raised.

Multiple Harvester / Parser Dataset Drivers

There may be some instances where multiple harvesters / parsers are needed for a single driver.  In this case a different harvester configuration is needed to keep track of the different harvesters.  Each harvester / parser pair is given a data source ID to link it to the corresponding harvester and parser combination, where each sub-dictionary identified by the ID will match the single directory harvester and parser configuration.  The harvester and parser configs are each indexed separately by the data source key due to different columns in the pre-load spreadsheet.

The driver state will be the same as the single directory harvester for now, with the addition of a data source key to index into it. 

The naming for the data source IDs is the name of the corresponding parser followed by '_telemetered' or '_recovered' for telemetered and recovered parsers, respectively.  If the same parser is to be used for both telemetered and recovered, this differentiates between the names.

For example the parser dosta_abcdjm_sio is used for both telemetered and recovered, so the two harvester IDs in the configuration would be: 'dosta_abcdjm_sio_telemetered' and 'dosta_abcdjm_sio_recovered'.  

This also applies to drivers with multiple different parsers.  For instance for the cg_stc_eng driver this combines the cg_stc_eng_stc, mopak_o_dcl, and rte_o_dcl parsers into one driver.  For the harvester keys this would translate into:

cg_stc_eng_stc parser: 'cg_stc_eng_stc_telemetered', 'cg_stc_eng_stc_recovered',

mopak_o_dcl parser:  'mopak_o_dcl_telemetered', 'mopak_o_dcl_recovered',

rte_o_dcl parser: 'rte_o_dcl_telemetered', 'rte_o_dcl_recovered'

State

We need to maintain state about all the files detected in the directory and parser state for each of those files.  The intent of the parser state is to ensure we can recover from a parser failure and partial ingestion. While the harvester state will likely be wrapped by class objects in the code, ultimately it will be stored as a serialized dictionary in OOIN via the agent persistence mechanism.  The state will be stored in a layered dictionary, with each file having a dictionary within the top level dictionary.  The key for each file sub-dictionary will be the file name.  There will also be a version field in the top dictionary, which can be used for backwards compatibility if there are changes to the harvester state in the future. 

Elements:

parameter description
file_name name of the file (not the full path, just the name), this will be the key for each file sub-dictionary
file_size size in bytes reported by stat
file_mod_date unix time in epoch seconds of file modification time
file_checksum calculated file checksum using md5 in python hashlib
ingested Boolean if ingestion of this file is complete
parser_state object specific to each parser representing parser state
modified_state (optional) if a file is modified after it has been ingested, the modified state will be stored here, and have fields: file_size, file_mod_date, file_checksum (same as described above)
version a top level version for this state, which can be used if the harvester state is modified in the future to handle backwards compatibility

Example:

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.