The focus of CEI-R2 is to extend and utilize the CEI-R1 functionality by delivering the following services:
Elastic Computing Services-extends the elastic processing unit infrastructure
Execution Engine Catalog & Repository Services-provides the models and capabilities to describe execution engines for processes
Resource Management Services-establishes standard models for the operational management (monitor & control) of stateful and taskable resources.
Process Management Services—provides the scheduling, and management services for policy-driven process execution at specified execution sites. The service supports the coupling of the dynamic data distribution service with the process and its triggering. Provenance and citation annotation are registered associating the input and output products with the execution process and its operating context.
Process Catalog and Repository Services—maintains process definitions and references to registered process engine configurations and execution sites.
Integration with National Computing Infrastructure—provide the capability to deploy OOI processing, in particular data stream processing on to the national/commercial computing infrastructure, focused on the Amazon Web Services (EC2)
Subsystem Service Groups
This release for this subsystem is composed of the following subsystem service groups:
Scheduling, provisioning, and monitoring services to maintain a balanced deployment of Operational Units (virtual compute nodes) to available computational resources (servers), considering node failures and changing demand. Provide a high level planner of EPU needs (e.g. this type of EPU of this size in this timeframe).
The purpose of elastic services is matching demand while ensuring basic reliability and quality properties. This service component will implement a Decision Engine (a.k.a., a Planner Service) capable of planning elastic computing needs so that operational units are deployed based on applying various policies to resource availability. More specifically, the engine will have the following properties:
• Sophisticated policy and constraint inputs including e.g., "X VMs of type Y will need to be deployed on resources of type Z in N hours”
• State input via sensors, e.g., resource availability now and in the future
• Output is a plan to be carried out by the provisioner, e.g., “deploy X VMs of type Y on resources Z”
The Decision Engine will be implemented as a service to allow for dynamic policy and state adjustment and it will communicate via OOI-specific interfaces (and thus support integration with e.g., the governance framework). The scope of the Decision Engine is central/global to an administrative domain (i.e., operates on state that is global to an administrative domain) but may be implemented as a composition of global and local components. Process management services will provide sensor input to the Decision Engine for resource management.
Services to support the contextualization of deployable units in registered execution environments. Contextualization sets the basic parameters for an instantiated process, such as network and IP address, so that it can play its part in a network of processes. Includes the requirement to adapt the process instance to the available resources provided by the execution environment, such as memory, computing power, storage etc.
Capability as a service to request, instantiate, configure and control EPUs on demand, for instance for user processes such as instrument agents, data transformations, visualizations. This extends EPUs from supporting services to supporting processes
Multi-site EPU Management
Capability to manage EPUs with Operational Units in multiple sites. Includes leader election in case of network partitioning and atomic decision making
Provide the operator interfaces and capabilities to operate and monitor the system and underlying compute infrastructure. This includes user interfaces, network monitoring and state of health monitoring integration, system statistics, troubleshooting.
Working with operations and ITV, develop small tools to upload and sync the different deployable type representations adapted to each site. After syncing, the deployable types will be registered with DTRS and any execution engines (a type of deployable type) will be registered with the execution engine registry service.
Capability to register and inspect the execution engines that are available to a particular user. Based on COI resource registry [can this do authorization filtering?]. Examples of execution engines include the different capability containers, Matlab, SQLStream processing engine, Grails etc.
Execution engine agent
Apply the resource agent framework for an execution engine agent. This agent represents and manages one execution engine instance an is the contact point for dispatching work to this execution engine.
The data model as resources and associations for taskable resources: stateful resources with behavior. Examples of taskable resources are EPUs, Operational Units, Agents, Processes, Services. Define life cycle model and metadata as related to lifecycle. We will assume that taskable resources become available after bootstrap.
[Taskable resource registration ]
A resource registry for taskable resources resources, based on the generic resource registry framework.
[Taskable resource management ]
Services accessible with a messaging interface for the management of generic taskable resource. Service implementations will integrate service-specific behavior. The services will include:
(1) Generic Provisioner: Provision new generic taskable resources on request (as in a factory pattern). Retire and remove taskable resources on request and when no longer needed (this can potentially happen either via the provisioner or by interacting with the resource itself).
(2) Planner: Accept requests for need of generic taskable resources and plan their provisioning according to policy. Planning leads to resource provisioning and control commands. The scope of this functionality for release 2 is preliminary and does not include co-scheduling.
(3) Controller: Control and monitor taskable resources. Control is provided form of making requests to a resource agent as specified by the taskable resource model (e.g., start/stop). Monitor the resource and react to abnormal conditions as recommended by the planner.
(4) Fault Monitor: Observes resource conditions coming directly from the resource via the agent and detects abnormal situations.
Apply the taskable resource model to SLA management for generic taskable resources, i.e. integrate the taskable resource model with COI Governance framework and negotiation between agents (similar to other taskable resource integration described below).
Develop template for resource agents based on governance agents. These agents manage taskable resources and should be able to control, monitor, represent and govern the resources in general, and for specific resources. Generalize work done by IPA team
Integrate with other subsystem teams on the management of specific subclasses of taskable resources. This includes initially:
(1) Marine observatory resources: Collaborate with the IPAA and S&A teams about instruments and platforms, which are special kinds of taskable resources, fronted by resource agents. These agents represent taskable resources in the marine observatories, which will be managed with CEI resource management services.
(2) Services: Collaboration with the COI team on the management of services and bringing them though their life cycle.
(3) Data management and storage resources: Collaborate with the DM and EOI teams about taskable resources around data management and their uniform management and their specializations. This does not include information resources, but other taskable resources, such as external data sources, storage sites, archive providers etc.
The integration will rely on revising the general resource management model with the teams; integration of specific functionality into the model will be performed by domain experts.
Services to manage the scheduling of processes on virtual compute nodes. Includes the requirement to notify the initiating actor of estimated turnaround and incorporate initiating actor constraints on execution resource type into the scheduling process. Processes can be scheduled either as always on, time-schedued or on demand.
Provide the capability to deploy OOI processing, both data stream and ocean models on to the national computing infrastructure. In particular, focus is on the Amazon cloud services for this release.
Framework for integration of external cloud providers
Provide an extensible general purpose framework for integration of external cloud providers, such that the Elastic Computing services can be extended to provision execution engines in these clouds, connected to and accessible by the remaining system network. The only supported implementation in this release is Amazon EC2 as commercial cloud provider in this release.