There are three major CEI goals for this iteration.
- Integration - we will integrate the process dispatching functionality into both the launch plan and to support transform processes dispatched by DM. To facilitate integration we will use a lightweight process dispatching approach that works on a single host. The end result will be a launch plan that bootstraps all OOI services locally and supports further processes being started by the Transform Management Service. Unfortunately CEI does not have all functionality needed for this integration ready at the beginning of the iteration. For example the lightweight launch, the pyon bridge to the Process Dispatcher, and myriad small features. These are tasks that will be done before the real integration effort can begin.
- CEI service multiprocess support - all of the CEI services are currently singleton but designed to support being spread across processes and VMs with the help of Apache ZooKeeper. We will implement this support for the Provisioner, EPU Management Service, and Process Dispatcher. Note however that these ZooKeeper-backed services will not be used for the integration efforts.
- Scalability/Stress testing - We will run a new series of scale tests against the CEI services. These will explore the HA, scalability, and other behaviors of our components. While we do want to produce graphs of these experiments, the primary goal is to flush out and fix bugs and architectural problems in the system.
The tasks are detailed in the Google Document.
CEI depends on COI to:
- Support "fail-fast" for any process failure in the pyon container. This will enable CEI to detect failures by monitoring a unix process directly.
- In general support questions and bugs related to pyon process launching.
CEI depends on Integration to:
- Drive the launch plan integration effort once CEI delivers the needed components
DM depends on CEI to:
- provide and support the needed integration components for transform process launches
Integration depends on CEI to:
- provide and support needed integration components for the process-based launch plan
- Integration is time consuming and unpredictable. It could take much longer than expected.
- Many of our tasks are dependent on each other. Delays in one (for example the lightweight launch) could have a ripple effect and block a lot of other efforts.
- CIAD CEI SV Lightweight CEI Launch
- CIAD CEI SV R2 Elaboration Process Launch Strategy
- EPU Command Line Tools
- Launch Plan, cloudinitd, and EPU CLI Configuration Interaction
- HA Example Service for Process Dispatcher
Overall this iteration was successful. Some tasks went over the initial estimate but thanks to the extension we were able to accomplish nearly everything. Along the way we found and fixed many issues and improved the overall code quality. Some highlights from the iteration:
- Ran many many scale tests on EC2. The goal was to push the VM management layers of the Provisioner and EPU Management Service to the point where we could see performance and scalability bottlenecks. At peak we started 1850 VMs simultaneously. Along the way we found many bugs and bottlenecks. We fixed some of these and are working on others during LCA and design weeks. Other more complicated ones will become Construction tasks. We also learned a lot about EC2 behaviors at scale. (Results)
- Developed a new launch plan that works both locally ("lightweight") and on real VMs. The lightweight mode is backed by the EPU Harness, a new utility that runs a Process Dispatcher and EEAgent standalone. The lightweight mode was integrated with the r2deploy launch (though the VM mode has not been). (Design)
- Integrated the Provisioner service with ZooKeeper, which is used for distributed synchronization and (for now) persistence. The goal was to also integrate the Process Dispatcher and EPU Management Service but this was not completed. It ended up being more difficult than expected and the assigned developer (David) spent a lot of time dealing with integration and Senior Developer responsibilities. However the model will be the same for these components as it is the Provisioner, so we still feel like the approach has been proven. Note that the large EC2 scale tests were backed by ZooKeeper Provisioners (10 of them).
- Integrated the pieces of the process management system together and with pyon. Worked closely with Jamie to integrate the full system launch.
- Performed preliminary integration of PD and EPUM, to "register needs" for more/less resources. Ran simple experiments on EC2 (Results).
- Numerous smaller features and fixes.