Respond instantly to network issues affecting system
|Actors||Integrated Observatory Operator
|References||UC.R2.46 Operate Integrated Observatory Network
UC.R1.29 Monitor System
|Uses||UC.R2.55 Manage Help Ticket|
|Is Used By|
|Is Extended By|
|In Acceptance Scenarios||None|
|Technical Notes||This use case focuses on the network part of the Integrated Observatory, not the entire system.|
|Primary Service||Resource Management Services|
|UC Status||Mapped + Ready|
|UX Exposure||ONC, MNC|
This information summarizes the Use Case functionality.
Network monitoring software is operated by the Operations team (the Integrated Observatory Operators, and monitored by network and observatory monitoring software operating within the Integrated or Marine Observatories. A network failure is detected through one or more of those paths. Regardless of the source, any automated detection of a network outage should immediately trigger the following actions: (a) An automated annunciation of the detection to the Integrated Observatory and Marine Operator displays, and possibly the helpdesk ticketing system; (b) automated notification of all Operations personnel on duty; (c) automated update of Observatory web pages and Operations displays; and (d) where feasible and safe, automated steps to recover from the network failure. Resolution of the issues is reported as described in the UC.R2.55 Manage Help Ticket use case.
- The same monitoring tools will be used for San Diego networks and the WAN
- The CI team is not assuming responsibility for Marine IOs networks, except where the networks connect.
- Marine IOs will notify Integrated Observatory Operators if there is a an outage on one of the marine observatory segments.
- The WAN provider will provide advanced notification of any planned outages, and will notify OOI of any unplanned outages as soon as they occur.
- CI tools can automatically generate JIRA tickets (possibly through sending email).
- CI system operators will need to configure monitoring tools for appropriate frequency of alert notifications.
- San Diego will have a switch/router in place to interface to the Integrated Observatory network.
- The Integrated Observatory will have multiple systems to monitor the network. All of these systems automate the reporting of problems. (See list in Comments.)
ION Network is operational.
- When problems on the network occur (layer 1-4) a message is sent from the detecting monitoring systems to the Integrated Observatory and/or Help Desk systems.
- Likely to be an email in most cases.
- Help desk software can more naturally process email, and possibly can notify the Integrated Observatory.
- Operations will determine if duplicate alerts are covering the same issue.
- JIRA system notifies the Help Desk operations team of the problem.
- Operations will notify the Marine Observatory Operators of any network outages.
- Marine Observatory Operators will be defined as watchers on any appropriate JIRA tickets.
- <3> Integrated Observatory displays for Integrated and Marine Observatory Operators are automatically updated.
- Must decide whether displays are updated directly from initial error detection, or from Help Desk; a likely scenario (because it may be simple to implement) is a display of network-component-related Jira tickets contains a new open ticket of highest urgency.
- For more detailed review, operators will look at dedicated application outputs.
- Intermapper should be used for the majority of staff who need visibility into network status. This includes the Marine Observatory Operators.
- Developer access to the Solarwinds also makes sense.
- <3> Outages are automatically reflected on publicly visible status web pages.
- A single page should consolidate the most important status reports.
- In the event no Integrated Observatory system can present a status page, an external system should be able to put up a fail page on behalf of ION.
- Automated network recovery occurs per the design of the network.
- The network recovery can take many forms: channel bonding (at layer 2), route redirection (at layer 3), DNS resiliency (hidden primary with many secondaries), DHCP resiliency (failover configured), A10 resiliency (multiple devices), also RSA, TACACS+, and switch resiliency (cross-connected hypervisors).
- When systems auto-recover, updates to Integrated and Marine Observatory Operators, Jira, and public status pages are desirable, but may not be possible in Release 2.
Displays and web pages accurately reflect network status during and after any outage.
These comments provide additional context (usually quite technical) for editors of the use case.
Operations must discuss scenarios, management systems, and network interfaces with CGSN and RSN.
An assumption was that "Both audio and visual signals will alert the Operations team to problems." It is not clear how the Release 2 system will present an audio signal, so this assumption has been removed.
The CI team is using the following automated monitoring systems:
- Intermapper (snmp alerts and high level holistic view)
- Solarwinds (detailed network performance and hypervisor/vm performance tool, application performance and monitoring and reporting)
- NetMRI (configuration backup / version control, configuration management)
- Traffic Sentinel (sflow collection and flow reporting)
- Lancope (netflow collection and application performance monitoring and security)
- Statseeker (generates detailed network interface statistics)
- DSView3 with power manager (provides power levels/use and cycling
- APC UPS (provides UPS issues)