|This page reflects partially outdated design strategy during the development of Release 1, in particular towards R1 LCA.|
The authoritative design for the CEI subsystem components, and in particular Elastic Computing is in the CI Architecture.
Do NOT modify this page anymore. Please modify directly in the CEI architecture pages.
This document resides here in Confluence but there are important versions of it that are snapshotted and "released" with specific names on specific dates. Links to those "official" versions are here:
- Version 0.2: CEI-Design-v0.2-2010-08-20.pdf (Confluence version 22)
Version 0.1: CEI-Design-v0.1-2010-04-20.pdf (Confluence version 15)
This document contains the definitive design of CEI software for the LCA review (August 2010) and several months following it. Some things that may be required in Release 1 may be called out inline with other design elements or they may be listed in the final "Possible Deficiencies" section of the document.
- 1. Introduction
- 2. Terms
- 2.1 Deployable type
- 2.2 Deployable unit
- 2.3 Operational unit
- 2.4 Contextualization
- 2.5 Service
- 2.6 Reliable Service / HA Service
- 2.7 Processes
- 3. Assumptions
- 3.1 Messaging Service
- 3.2 Capability Container
- 3.3 Exchange Space
- 3.4 Exchange Point
- 3.5 Reliable Data Storage
- 3.6 Stateful vs. Stateless Services
- 4. Design Overview
- 5. CEI Components
- 5.1 Provisioner
- 5.2 Deployable Type Registry Service
- 5.3 Sensor Aggregator
- 5.4 EPU Controller
- 5.4.1 EPU Controller Decision Engine
- 5.5 EPU Worker
- 5.6 OOI bootstrap commandline
- 6. Bootstrapping
- 7. High Availability
- 7.1 Definition of HA
- 7.2 Failure Matrix
- 8. Possible Deficiences
CEI software is responsible for making services highly available.
An exchange point is a component (authored by COI) that allows an entity to address messages to just one endpoint in the greater distributed messaging system for any service type that it interacts with. When a service is made highly available by CEI software, entities will never address a service instance directly, they will use an exchange point instead.
The CEI software design is centered around a set of components collectively called an EPU (Elastic Processing Unit) or "EPU infrastructure." This EPU infrastructure will make sure the actual service instances that end up processing the messages directed to the exchange point are always available, never fail, and scale elastically to meet demand.
The main instruments used to make this happen are virtual machine instances launched via "infrastructure-on-demand" services like Nimbus and EC2, typically referred to as "IaaS" (Infrastructure-as-a-Service).
An abstract description/template/recipe of an environment in terms of software packages, OS, etc. When instantiated, it will perform a specific task in the OOI network. Deployable types are registered and made available for automatic instantiation or modification by users based on a configuration of software deployment packages. Deployable types are independent of any specific execution site format. Each particular type will have its own unique identifier in the Deployable Type Registry Service (a CEI component defined later).
- Example: VM template image that contains a particular OS (say, Ubuntu 9.04) with particular libraries (say, Python 2.6.4) that all run a specific version of the COI capability container and a specific set of services ("Transformer service v.0.4" etc.). A new permutation of software (even at the version level) necessitates a new unique identifier in the Deployable Type Registry Service.
A specific rendition of a deployable type, e.g., a VM image registered for use at the Amazon EC2 service or Nimbus repository. There could be simple and complex deployable units. Complex deployable units represent virtual clusters (a collection of VMs that share a security and configuration context) or e.g. a set of units representing a workflow platform. In practice, a "seed" deployable unit will be the actual image "bits" in the repository (see below).
- Example: In practice this will not be a particular AMI instance or Nimbus repository image, it will be a seed deployable unit coupled at boot time with whatever we need to contextualize on it to make an operational unit that represents the desired deployable type.
An instantiated (i.e., deployed) deployable unit. Which by inheritance also means an instantiated deployable type. An operational unit is created at deploy time through the process of contextualization.
- Example: a contextualized, running instance of the desired deployable type. For example an Ubuntu 9.04 instance with Python 2.6.4 installed running a specific version of the COI capability container and a specific set of services ("Transformer service v.0.4" etc.) that were brought up during the VM's instantiation and contexutalization process.
The process executed immediately after instantiation of a deployable unit before it becomes an operational unit. In practice this will be used in different phases of bringing an operational unit into existence.
- Security/enrollment bootstrap into the OOI network (bootstrapping the Capability Container)
- Other higher level registrations
- Turning Seed Deployable Unit into the required Deployable Unit
- A seed deployable unit is an optimization for launching operational units that the Provisioner (defined later in this document) will take. It saves a lot of human time to have a slim virtual machine registered for use at specific sites which is transformed into the desired instance of the deployable type that was requested by the system. It is deployment-type-specific whether or not this is a good strategy. Any time the strategy is used, it is an encapsulation behind "deployable type --> operational unit" which is the mapping that really matters to anything using the Provisioner service. Entities ask for deployable types and operational units are brought into being.
- Example: There are many ways we will use contextualization (bootstrapping etc.), some of the higher level scenarios are called out here
An entity in the system that can be found and addressed by name and realizes a specific purpose. Nothing is known about the location of the service or its internal structure/implementation. It is registered in the COI service registry. Provided by a deployed software component package, integrated through a capability container.
- Consult this page for more about services.
- In the context of that explanation of services: an important idea for the EPU architecture is that there is an out-of-band inspection of the messages between Requesting Service and Providing Service.
- Picture: Service-Integration-Invisible-Hand.png (TODO)
- Example: "Transform service" "Process Registry Service" "Data Stream Registry"
A service that is backed by EPU infrastructure. It is addressable by a unique name that an entity can direct a request to. The messages are queued up at this exchange point but actually processed by unique service instances (the mechanics of this are explained in detail later in this document).
- Example: "HA-Transform service" (this document uses this fake example)
A service is a process that is run the entire life of an operational unit. There is a notion of "process-process" which are things started (and potentially cancelled) independently of operational units but that is out of scope of this document.
The CEI software makes the assumptions that other components and functionality exists already.
A Messaging Service provides a flexible, asynchronous way to deliver messages from an entity in the OOI system to any other entity (subject to policy). The current implementation relies heavily on RabbitMQ AMQP brokers.
A Capability Container is a container that runs service code and provides it (directly or via proxy) with any infrastructure service it needs. It is responsible for initializing service code and keeping it alive with the local system resources necessary. It allows application/service code to easily adapt to the Messaging Service. It subjects all service activity to the configured policies.
An Exchange Space is realized by a collection of Messaging Service instances that have a mutual security/namespace agreement that allows entities to address one another (subject to configured policies). A client or service is "enrolled" in the exchange space and from then on is a member of the system (this works much like a VPN that realizes an overlay network). The mechanics and details of this are outside the scope of this document.
An Exchange Point receives and manages messages, manages and fulfills subscriptions, has an identity, has a message persistence strategy, is reified across multiple brokers and is a finite state machine (FSM).
A data store will exist that allows a service to store information that will persist beyond crashes.
It must be transactional
- the CEI service writing information must know that a set of writes has completed (so that it can e.g. correctly move on with an internal state change).
It must be consistent
- the moment a CEI service writes something, a subsequent read by the service (e.g. if it is in recovery mode after a crash) should result in that written data being returned, not a previous value.
There are two types of CEI services that will be written.
- A "stateful" service is configured by an external entity with information at boot time. The information is used during runtime and is not changed. Or if it were changed it is not of consequence to the high-availability scheme: i.e., the service can die and be restarted by an external entity without any participation of an up-to-date data storage read.
- A "stateless" service may only be minimally configured by an external entity with information at boot time. But it reads the information it needs from a reliable data storage service (see the last assumption "Reliable Data Storage) to recover from crashes. During its runtime, it is constantly updating this data storage system with any information it would need stored in order to recover gracefully.
Some OOI context is presumed.
Consult the following diagram:
The diagram only contains components in one exchange space that can address each other by name.
The "Transform-v2 service" is required to be highly available.
A instance of a component called the EPU Controller is started in order to make the Transform service highly available. It declares that a "HA-Transform" exchange point be created in the messaging fabric. This instance will be called "the EPU Controller for HA-Transform-v2."
Assume a highly available Provisioner service has been brought online in an exchange space, it is named "HA-Provisioner" and it will be explained later in the document how it itself is made highly available. Right now we are only discussing how non-CEI services are made highly available.
The provisioner is responsible for adapting to IaaS sites, enabling other CEI components to request contextualized VM instances and track their status (both in terms of VM lifecycle and contextualization status).
The EPU Controller for HA-Transform makes a request to the HA-Provisioner endpoint that an instance of a specific deployable type be started. This deployable type is known via configuration to start Transform service instances.
The provisioner launches a VM that, through contextualization, runs a capability container that runs an instance of the Transform service which we will call "Transform-0".
An onboard agent in the same capability container called an EPU worker will be configured to retrieve the next work message from the HA-Transform exchange point.
Now when a client (requesting service) of the HA-Transform service sends a message there will be an instance of the Transform service to accept the message.
A CEI component called the Sensor Aggregator was started at the same time this particular instance of the EPU Controller was started. It is specific to this service and we will call it "the Sensor Aggregator for HA-Transform."
It subscribes to information about the exchange point, the specific instances launched via the provisioner, and obtains any other data relevant to the EPU Controller's decisions.
Using information obtained from the Sensor Aggregator for HA-Transform, the EPU controller for HA-Transform is constantly evaluating the current state of the HA-Transform exchange point and the service instances. In response to certain situations, it can start/stop the appropriate numbers of compensatory instances to handle the current Transform service load. This decision is informed by the sensor aggregator's data as well as policies specific to the sites, situation, money, clients, etc. that are relevant.
Contains adapter logic needed for any context broker and IaaS implementation. Keeps track of the state of any VM instance or context that has been launched.
There is one provisioner in each exchange space, it is itself run as a high-availability, EPU-ified service. It is written to be stateless, an instance of it can be instantiated and use the data store to know what internal tasks it should launch to recover.
- Launch and contextualize a specific number of a specific deployable type at a specific site, identify the launched entities with unique client-provided identifer(s) (implementation choice is UUID(s)).
- Destroy a given operational unit(s) (client gives UUID(s)) that were launched
- Subscribe to the state of given operational unit's UUID(s) that were launched
Entities can subscribe to receive status updates about anything launched.
In order to work as an IaaS and context broker client, it must be brought online with the necessary credentials. A root owned file stores secrets, secrets are written to it during contextualization, only ever lives on operational units, never deployable units or seed deployable units.
Relies on Deployable Type Registry Service to lookup "how to get from the needed type to a running instance", see the subsection below "Deployable Type Registry Service"
The Provisioner needs to look up in this registry service what a deployable type actually "means" in terms of what it needs to launch.
The Deployable Type Registry Service is essentially a key/value store that maps needed types ("Transformer Service v3") into most of the needed information for a launch.
The deployable unit that, along with contextualization, will allow the deployable type to be realized as an operational unit is assumed to have been deployed to the site in question.
- Deployable type
- Site to run it
- contextualization document template. This is described in detail later but it contains any necessary information to bring about the desired instance of the deployable unit (including the EC2 AMI identifier or Nimbus image name/location). This will also include whatever contextualization data it takes to do on-the-fly conversion of a seed deployable unit into the desired type of deployable unit as the operational unit is instantiated.
The sensor aggregator obtains data about exchange points and EPU workers among other metrics like operational unit data/statuses. It uses various mechanisms to obtains this data and presents it all via subscription to the EPU controller.
Three examples of things it could monitor:
- Queue draining rate
- Available disk space of operational units
- CPU load of operational units
- Network load?
- Operational unit status 
 - The sensor aggregator is what subscribes to the Provisioner for state changes. Technically the EPU-Controller will make "create" calls to the Provisioner and cause the Sensor Aggregator to be subscribed (TODO: how can that happen exactly with the available mechanisms, the alternative is to have the EPU Controller call the SA and tell it to subscribe to a specific UUID which is not as compact/atomic of a procedure. It would be great to simply have a way to get a particular Sensor Aggregator to subscribe to any single instance that was started by a particular EPU Controller).
Each unique reliable service has a unique EPU controller instance.
The main responsibility of the controller is to evaluate data from the sensor aggregator (see Sensor Aggregator section below) against policies and cause the correct compensation actions to occur if necessary.
All compensation actions will be attempted via messages to the HA-Provisioner.
Basic examples of actions: create one instance of deployable type XYZ, destroy one instance of deployable type XYZ
Each controller instance has to be monitorable: able to die unexpectedly, be brought back up by a fault compensation supervisor instance, and be able to continue its work where it left off.
Each controller instance is itself running in an operational unit.
It is bootstrapped (during the instantiation of the operational unit) with information:
- one exchange point to create and monitor (i.e., one HA service to provide)
- one deployable type it can launch as compensation
- policies/hueristics about the particular interaction patterns: metrics need context in order to make the right compensation decisions
- policies/hueristics about deployable type sizing
Each unique reliable service has a unique Sensor Aggregator instance and the controller subscribes to updates from this in order to get information about the running system.
The EPU Controller instance contains a stateless decision engine that is constantly evaluating the following inputs:
- sensor data
- given policies/hueristics 
 - Live reconfiguration of policies/heuristics to use is out of scope of this document, the policies are presumed to be configured just once when the service is instantiated.
This decision engine makes the decision about what compensatory units to deploy, terminate or cleanup for the current situation given a set of constraints (in various dimensions).
The EPU Controller then is able to task the Provisioner with accomplishing its goals.
An error in provisioning (e.g. the IaaS site simply rejects the request) will result in new sensor data that needs to evaluated during the next "pass" of the decision engine (see below).
How resources are represented will need to be elaborated in the future. It is not just an internal representation since policy writers need to be able to express their heuristics about how "much" resources and what kinds of resources will (likely) cause certain desired compensating behavior. This will either be hardcoded or very simple for the current scope.
TODO: define different states in which compensatory units can find themselves as viewed from the EPU controller instance.
Any message directed to the HA service address will be enqueued at one specific Exchange Point. The EPU controller monitors this Exchange Point but does not draw messages from it. Instead, via the Provisioner, it launches EPU workers that are contextualized to subscribe to this one Exchange Point.
Each message addressed to the reliable service is handled by a service instance in a particular operational unit. The applicable operational unit that can handle the messages in question runs an EPU worker agent
An EPU worker agent requests the next work message from the exchange point. It is configured to either draw "any" next message or a message with a specific conversation ID (session). Conversation IDs are out of scope of this document.
The message is delivered and passed to the service instance running in the same COI capability container.
The consumption rate for each worker will be based on an "on-board" policy about how many messages in this interaction pattern it can be processing at once. The EPU worker agent is configured during the instantiation of the operational unit where it is running.
TODO: clarification of consumption rates
NOTICE: This component cannot be developed without an Exchange Point implementation. In lieu of this, a service is "just a service" and it can draw messages from a named queue as normal.
Described in next section "Bootstrapping"
This section describes how the working system comes into being. By understanding the bootstrap it will also be clear what starts/monitors the CEI software. The previous sections of this document explained how other software is made available using the CEI software but that assumed certain CEI software was running in the first place (like e.g. the HA-Provisioner).
The operator launches a program called epucontrol that will carry out each of the following steps.
Follow along with the following picture, the picture has a time axis from top to bottom. Each of the following steps happens in lockstep. There are no parallel steps, each service/VM is launched, is then verified, and then the process moves forwards.
Consult the following diagram. The circled numbers correspond to section numbers in the subsequent content of this document.
Before it starts, epucontrol has access to the following information:
- Security information required for each launch
- All current information for Deployable Type Registry Service
- TODO: enumerate everything else needed
It brings just one context broker online in bootstrap mode. This launch can only use the IaaS provided contextualization mechanism (i.e., Amazon EC2's user-data or equivalent) because there is no context broker available yet but the context broker itself needs to be contextualized.
It waits for the context broker to come online and makes a test call (sanity check). We want it to fail-fast, there should be no "debugging" later to find out that the root problem was that the context broker failed. This is a general principle for the whole bootstrap process.
It brings one Messaging Service operational unit online (we will move in mid-term to starting two as a virtual cluster using the context broker).
It waits for the Message Service to come online and runs a COI provided suite of sanity checks and initial configurations.
It brings several core COI services (Data Store, Registries of various kinds) online in bootstrap mode. In the same "batch" of services is the Deployable Type Registry Service (DTRS)
- Knowledge of the DTRS data already allowed the program to complete the previous steps (no component needed to consult a DTRS in order to know what exact thing to launch).
- This part of the bootstrap needs to be coordinated tightly with COI subsystem
It waits for the services to come online and runs a COI provided suite of sanity checks and initial configurations.
It also seeds DTRS with data (and runs sanity checks).
It launches one "Base instance" that has a EPU-Controller, a Sensor Aggregator, and a Provisioner-Provisioner for the HA-Provisioner service, seeding it with IaaS credentials.
- The EPU-Controller for the HA-Provisioner has a Provisioner-Provisioner in it that has IaaS credentials and the deployable type needed for Provisioner instances
- That Provisioner-Provisioner is always used to start Provisioner instances, not the HA-Provisioner that the EPU-Controller is making highly available.
The HA-Provisioner has to be bootstrapped and built differently than other EPU-ified services because the EPU-Controller for the HA-Provisioner cannot rely on the HA-Provisioner service to do work that it needs. Instead it relies on an on-board Provisioner.
Consult the following diagram:
It waits for the base CEI instance to come online and runs tests.
Any other high availability services are now brought up with their specific EPU Controller and Sensor Aggregator instances. The HA-Provisioner is used normally as discussed in this document. Any EPU controller and sensor aggregator can be run and they will interact with the HA-Provisioner that has been brought online.
Now is when redundant Messaging Service and COI Core Service instances would also be brought online.
Finally the epucontrol program will daemonize itself and serves as a supervisor for all of the nodes that it launched. This is called the bootstrap supervisor and is a catch-all fault monitor. In the future, supervisors will be operational units themselves and the "root" of watching the entire system will be the responsibility of a staffed operations team. The epucontrol program will register those supervisors with the appropriate mechanisms and exit instead of daemonizing.
One of the primary requirements of a reliable service (an "EPU-ified service") is that it does not ever go down, it must "always" be available.
- Strawman definition of "always"
- 0.001% (5 nines) of unanticipated downtime for user observable services (for an entire month's deployment this is 26 seconds!).
- Strawman definition of "user observable"
- A module's service interaction messages are the user observable services that the EPU "fulfills."
- Strawman definition of "downtime"
- All EPU workers need to pick up messages within a certain time period
- The current idea of what that time period is: one or two seconds
This table explains what happens when any oeprational unit in the system is corrupted.
|Operational Unit||Services on it||Failure notes|
|Your laptop||epucontrol program||This is the root supervisor, it is serving as the "last resort" supervision code for the timebeing. If it itself goes down, there is nothing left but a human to notice. This code will expand in the future: it will register supervisors with the operations team and exit.|
|#1||Context Broker #0||epucontrol monitors and restarts this instance|
|#2|| MessagingService #0 (includes all Exchange Point instances)
||EPU Controller & Sensor Aggregator for HA-MessagingService (and HA-Provisioner) ensure there are always at least two MessagingServince instances (or N if policy/situation requires it, or one if in development mode)|
|#3-N||MessagingService #1-N (includes all Exchange Point instances)||EPU Controller & Sensor Aggregator for HA-MessagingService (and HA-Provisioner) ensure there are always at least two MessagingServince instances (or N if policy/situation requires it, or one if in development mode)|
|#4||CoreServices+DTRS #0||EPU Controller & Sensor Aggregator for HA-CoreServices (and HA-Provisioner) ensure there are always at least two CoreServices+DTRS instances (or N if policy/situation requires it, or one if in development mode)|
|#5-N||CoreServices+DTRS #1-N||EPU Controller & Sensor Aggregator for HA-CoreServices (and HA-Provisioner) ensure there are always at least two CoreServices+DTRS instances (or N if policy/situation requires it, or one if in development mode)|
|#6||Provisioner #0||EPU Controller & Sensor Aggregator & Provisioner-Provisioner for HA-Provisioner monitors and restarts this instance|
|#7-N||Provisioner #1-N||EPU Controller & Sensor Aggregator & Provisioner-Provisioner for HA-Provisioner monitors and restarts this instance it exists (or N if policy/situation requires it)|
|#8 Base CEI Instance||EPU Controller & Sensor Aggregator & Provisioner-Provisioner for HA-Provisioner||epucontrol monitors and restarts this instance, there is only a need for one of them|
|#9||EPU Controller & Sensor Aggregator for HA-MessagingService||epucontrol monitors for failure and restarts|
|#10||An EPU Controller & Sensor Aggregator instance for each HA core service||epucontrol monitors for failure and restarts|
|#N||ServiceX #0||EPU Controller & Sensor Aggregator for HA-ServiceX (and HA-Provisioner) monitors and restarts this instance if necessary|
|#N||ServiceX #1||EPU Controller & Sensor Aggregator for HA-ServiceX (and HA-Provisioner) monitors and restarts this instance if necessary|
|#N||ServiceY #0||EPU Controller & Sensor Aggregator for HA-ServiceY (and HA-Provisioner) monitors and restarts this instance if necessary|
|#N||ServiceZ #0||EPU Controller & Sensor Aggregator for HA-ServiceZ (and HA-Provisioner) monitors and restarts this instance if necessary|
Without two-level scheduling (direct management of scheduling/compensating service instances inside the first layer of scheduling/compensating operational units), there seems to be too much opportunity for underused VM instances in a scale-down situation.
A compromise strategy might be to have an EPU Controller that declares sets of Exchange Points, say a bundle of 5-10 very common services. One deployable type is provisioned that cause one new operational unit to exist which has these 5-10 very common services.
Context broker does not currently get AMQP messages and so the HA strategy for it will need to be discussed. We think it can currently handle thousands of concurrent instances so keeping one instance alive should be "ok" for now if it can recover from failures gracefully. The contextualization process includes a "keep trying" semantic in the VM ctx-agents so if the broker is not around for a minute
If "one or two seconds" is all that a service can be absent, the strategy in this document will not work for HA services that only have one operational unit servicing messages from an exchange point.
Considering it could take upwards of a few minutes to launch a VM instance to replace a corrupted operational unit, there would have to be two always running, probably in different "availability zones" (essentially: different data centers with different networks and power).