Skip to end of metadata
Go to start of metadata

Table of Contents

Overview

For background, see CEI R1C2 Wrapup

For information that was only relevant during the planning process (February 14th - 25th, 2011), see CEI R1C3 Potential Tasks

Result: CEI R1C3 Wrapup

Iteration Tasks

JIRA

The 'official' list and statuses are in JIRA. You must be logged in to JIRA before the following link works (note it is now a separate login from confluence):

Packaging and Images

Task Description URL JIRA
Packaging
Create proper Python packages and release mechanisms for all code as well as the dt-data repository.
CIDEVCEI-155
Package Separation Work with COI to be first set of services to "detach" from lcaarch into a separate package (A work-remains from last iteration)
Code Structure, Packages, Assemblies, and Repositories
CIDEVCEI-128
Packaging integration
Create recipes to handle package/assembly installation and not use git checkouts.  This includes any "tac" file integration that will be necessary to use ION as an installed library only. CEI R1C3 - Subtopic - System Integration CIDEVCEI-156
Move to base image Adjust recipes to work with new CentOS base image (A work-remains from last iteration) CEI R1C3 - Subtopic - System Integration CIDEVCEI-133
Test with base image on OOI Nimbus install The whole system must actually run on the specific OOI IaaS installation (we've run the system against Nimbus before, this task is about validating the specific image and the specific IaaS platform) CEI R1C3 - Subtopic - System Integration CIDEVCEI-148
Ensure libcloud enhancements are propagated upstream We have enhanced libcloud to be compatible with contextualization/Nimbus, the task is to clean this up for good and submit the changes upstream to be included in the next release.  This will allow us to keep updating libcloud versions (instead of getting new library updates and reapplying the patches). LIBCLOUD-75 CIDEVCEI-151

Monitoring

Task Description URL JIRA
OU Agent Include a default process launch on worker instances that will deliver a heartbeat to the EPU controller.  This will be treated as a new sensor (beyond IaaS status reports) that can be used to gauge whether the worker is failed or not. This will also be coordinate with the "respond to container exit status" work which will be able to report specific failure errors if necessary (and/or deliberately stop the heartbeat pulse). CEI R1C3 - OU Agent CIDEVCEI-160
OU Agent Sensor Create a new sensor for the OU Agent and altering the decision engine implementations to take into account the heartbeat (and compensate for it).
CEI R1C3 - OU Agent Sensor CIDEVCEI-161
Respond to container exit status
When Python CC exits with a non-zero exit code, this should be considered a failure and that fact should be communicated via the heartbeater (and logged as a critical cloudyvent)
CEI R1C3 - Respond to container exit status CIDEVCEI-162
epumgmt status Gather and output a lot of status information about the system.  Provide hooks and different output options.   CIDEVCEI-170

Reliability & Correctness

Task Description URL JIRA
Rabbit security Adjust launch-plans and recipes to coordinate password on server and all clients   CIDEVCEI-149
Provisioner dynamic configuration The coordinates for IaaS services and context brokers are hardcoded to the provisioner currently (the launch plan may only select from a predefined list of options).  The task is to make this fully dynamic, the list of options should be stored with the launch plan and populated at provisioner boot time. CEI R1C3 - Provisioner dynamic configuration CIDEVCEI-154
EPU Controller dynamic configuration The service names of the EPU controllers need to be differentiated so that multiple instances can run at once. CEI R1C3 - EPU Controller dynamic configuration CIDEVCEI-152
Repair option for base system When a node in the base system itself fails, there should be a mode where its supervisor should restart it automatically.  This could possibly mean restarting every bootlevel above it if the clients of this service can not handle transient availability issues (and this may need a new configuration that says when to do this or not). CEI R1C3 - Repair option for base system CIDEVCEI-165
Multiple provisioner processes (for better performance) The container is processing only one message at a time currently.  Some provisioner code blocks on IaaS and context broker callouts.  Instead of addressing this with bolt-on concurrency management, add provisioner processes. CEI R1C3 - Multiple provisioner processes
CIDEVCEI-153
Java apps The recipes are focused on launching things based on the Python container currently, adjustments are necessary for ioncore-java, Grails, and THREDDS
CEI R1C3 - Subtopic - System Integration
System restart
Ensure that a pre-existing Cassandra instance can be used with a launch plan many times, this allows a full system restart.  Mostly COI resource registry's/Cassandra's responsibility to get the schema creation and re-creation (or "no op") correct.  And to make sure systems that start afterward are capable of handling pre-existing data.
  CIDEVCEI-163
EPU Controller Persistence Ensure EPU controller can fail and be restarted (including on another machine) (A work-remains from last iteration)
CEI R1C2 - Subtopic - Persistence CIDEVCEI-122
Adjust errors to use the new fail-fast exit code Adjust the CEI services to use the coming functionality that allows expression of "unrecoverable error" CEI R1C3 - Respond to container exit status
CIDEVCEI-150

Evaluations & Testing

Task Description URL JIRA
Evaluations CEI stress testing and robustness evaluations.  Fix problems that occur.
Enhance evaluations to examine EPU components themselves dying The previous tests do not examine EPU components dying.  Fix problems that occur.   CIDEVCEI-172

Support, Usability & Documentation

Task Description URL JIRA
Assist with deployable type and launch plan creation Other people ultimately need to create the specific DTs and launch plans but CEI needs to provide direction on it and ongoing help CEI R1C3 - Subtopic - System Integration CIDEVCEI-169
Launch initial VMs simultaneously Add flag to launch base images simultaneously, this will make the developer/test cycle orders of magnitude smaller.  The launch plans will kick off things in bootlevels by starting VMs in lockstep (as the particular plan dictates), checking on each level before proceeding.  Our system uses a common base image with a contextualization process after boot that is kicked off.  The majority of the time spent in each level is bringing up the base image (without contextualization) at the IaaS service.  The task is to separate instance starting from the contextualization (an optional flag). Going through the contextualization after the instances start represents the bootlevels in this case. CEI R1C3 - Launch VMs simultaneously CIDEVCEI-164
Assist ITV with verification test creation During requirements meeting 2/24/11 it was indicated that some ongoing work this iteration will be to assist with fleshing out exact details of requirement verification tests (which will be executed during Transition phase).  This will involve phone calls, Jabber, and possibly some script writing or (slight) alterations to code to make it easier to test.   CIDEVCEI-171
Launch-plan validation ("dry run") Instead of running into problems along the way, introduce a dry-run feature that will allow someone to author a launch plan more easily.  This will help avoid problems that are known a priori to cause a launch to fail (for example, a dependency is referenced from a previous bootlevel but that does not actually exist anywhere in the configuration). CEI R1C3 - Add validation step to cloudinitd CIDEVCEI-166
Launch-plan organization ("DRY" principle) Currently some configurations in the launch plan samples are being repeated over and over (especially because the same image is used for everything).  Add environment variable capability to some files. Add variable substitution capabilities wherever they are required. CEI R1C3 - Make launch plan config easier to create
CIDEVCEI-167
Per-service logfiles for cloudinit.d Essential for making sure a launch can be debugged (especially during its authoring) CEI R1C3 - Per-service logfiles for cloudinit.d CIDEVCEI-168
Work with Operations on Nimbus IaaS installation (A work-remains from last iteration) http://www.nimbusproject.org/docs/current/admin/index.html CIDEVCEI-147
Update the CEI pages in the CI architecture Refine the design of the work package and document directly in the CI CEI architecture pages on Confluence (A work-remains from last iteration)
CIAD CEI Common Execution Infrastructure CIDEVCEI-135

Subsystem Dependencies

This section discusses the agreements CEI has in place with other subsystems for this iteration.  i.e., the work we are all expecting from each other and by what time.

Expected from CEI

Task/Issue Who drives it Who else is involved Due date What this task blocks Comments
Operations to receive direct advice/tools for monitoring Nimbus IaaS DavidL TimF, BrianD 3/4 Full operations setup Shava
Introducing EPU layer to first integration test TimF TimF, JamieC ?   Need to understand multiple datastore situation.  Also, the packaging situation might be more of an emergency here since the "app" framework is what is being used for priming scripts
Merge context broker to sample launch plans DavidL TimF ? Many people outside CEI will start using launch plans this iteration, this means there is no need for getting public credentials for a ctx broker  
ITV Verification Tests RogerU DavidL,TimF,JohnB, JamieC, Alan/Alex   Blocks ITV During requirements meeting 2/24/11 it was indicated that some ongoing work this iteration will be to assist with fleshing out exact details of requirement verification tests (which will be executed during Transition phase).  This will involve phone calls, Jabber, and possibly some script writing or (slight) alterations to code to make it easier to test.
Java Apps TimF ?   Dynamic launch of an ioncore-java program Grails/THREDDS were taken off the table as something dynamically launched, but the integration tests will still launch things with ioncore-java, so this needs to be dynamically configured
CEI support for operations running IaaS DavidL TimF, JohnB, Operations Ongoing    
CEI support for people running whole system TimF * Ongoing    

Expected by CEI

Task/Issue Who drives it Who else is involved Due date What this task blocks Comments
Container exit codes, the "fail fast" flag Dorian TimF, DavidL 3/11 This task blocks an end to end integration of failure management.  Currently CEI monitoring tools cannot know if a process/container has gone into an unrecoverable error state.  
Base image releases AdamB TimF (may need to be several iterations of the image) Blocks ability to lock in dependencies/platform should consider ctx-agent pre-install ("vm-bootstrap" script... it does nothing in the cases where it should do nothing)
Messaging Stress AdamS TimF, PaulM 3/11? Should go into CEI evaluations with information about lower layers  
Making services Cassandra aware DavidS TimF   Some services need a direct connection to Cassandra, the mechanism for this needs to be coordinated This is like getting Rabbit info into everything
System Restart DavidS DavidL, TimF, MattR?   Ensure that a pre-existing Cassandra instance can be used with a launch plan many times, this allows a full system restart.  Mostly COI resource registry's/Cassandra's responsibility to get the schema creation and re-creation (or "no op") correct.  And to make sure systems that start afterward are capable of handling pre-existing data.
 
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.