|This page describes the plan for a subsystem iteration including the high level goals for the iteration. It should also contain dependencies on other subsystems, plans for integrating functionality and a list of iteration tasks or a link to the subsystem google doc containing the iteration task details. This page should also contain references to the associated architecture pages, construction plan pages, and use cases.|
The overall goal is to maintain and enhance CEI system stability. We will also add small features to round out the functionality.
- Improve system launch capabilities by allowing multiple CEI service workers, and allowing different instance types to be used in the same launch.
- Enhance Process Dispatcher scaling logic to boot new VMs sooner than they are needed
- Introduce backoff to process restarts, to reduce system churn for processes that fail consistently
- Bring service autoscaling into production, and improve CLI tools for managing it
- Improve Process Dispatcher detection of stalled containers and processes
- Prepare for R3 by eliminating the mirroring in the Process Dispatcher between ZooKeeper and the Resource Registry
- Consolidate Agent base classes for HA Agent and EE Agent
- We depend on COI to provide process health events in the container
- We depend on SA to refactor Agent subsystem
- We did not allow much time in the iteration for general integration activities, bug fixes, etc. In the past we've found these activities take 25-35% of our total time.
We accomplished a lot this iteration but did drop some tasks. The high points of the iteration are:
- Service autoscaling was tested, improved, and enabled in the nightly launch for the visualization service. We use the process saturation metric emitted by the container, while is a percentile measure of the amount of time each process spends processing messages versus idling. We roll these up as an average over all service workers. There may be some room to improve the metrics.
- CLI tools for managing HA Agents are greatly improved. Since CEI Agents will not be represented in the UI in R2, CLI tools are important. It is now much easier to interrogate and manage HA Agents. It also is now possible to change scaling policies (between sensor-scaling and simple HA policies) for a running service. It is also much easier to change parameters on existing policies, and to query a list of all HA services in the system.
- The Process Dispatcher now uses heuristics to detect containers that have stalled or died. These containers are not assigned any more processes and eventually their existing processes are rescheduled to other containers (if possible).
- Process Dispatcher now scales more intelligently, by requesting new VMs in advance of actual need.
- Several other small features were added. We also worked to support the scale testing and provided multiple fixes.
The biggest boondoggle was the attempt to eliminate Process Dispatcher mirroring. At least two weeks were spent working on it, and going through multiple approaches. The root issue was the different integrity constraints provided by ZooKeeper and the resource registry, in particular with regard to associations. We have a (hopefully) better plan now that we can pick back up in R3. For R2 we expect little difficulty as a result of dropping this task.
We also did not attempt to port the CEI agents away from the simple agent base class. SA did not deliver needed refactors in time.
We expect to stay busy during transition. On our plate:
- Support system scale testing. Jonathan has been finding occasional issues but we haven't had much time to investigate.
- Robust EPU Management Service -> Provisioner messaging. A longstanding bug (OOIION-754) is that a well-timed failure could leave the EPUM service in an inconsistent state. Fix this by improving messaging retry behavior.
- Robust Process Dispatcher -> EE Agent messaging. A very similar issue to the above. Jonathan encountered a race condition in his scale tests that would have been alleviated by better messaging behaviors in the PD Matchmaker.
- General integration support and bugfixes. We have a few other OOIION bugs to look at, and others will certainly emerge.
- Test code coverage. Our coverage is pretty decent, but we will look for any obvious improvements.
- Documentation. Specifically, we need to improve our operator-facing documentation.