Peer worker processes gracefully recover a failed process and its state.
|Actors||Integrated Observatory Operator|
|References||UC.R1.25 Assure Reliability|
|Is Used By|
|Is Extended By|
|In Acceptance Scenarios||AS.R2.04B Curate Data Products|
|Technical Notes||This use case advances the R1 use case cited in the References.
The use case is an automated (internal) use case; the operator is involved only to receive notifications.
|Primary Service||Capability Container & Distributed Service Infrastructure Part 2|
|UC Status||Mapped + Ready|
This information summarizes the Use Case functionality.
Peer worker processes in a service take over for a failed process, recover its conversation state, and roll back or compensate any side effects of incomplete processing. Use a distributed state repository to automatically recover failed processing. Compensate for non-idempotent actions (actions that cause side effects).
Process has failed in a way that can be identified by monitoring services.
- Monitoring service detects that a process is non-performant.
- Process may have gone away, locked up (become non-responsive), or just stopped processing messages mid-stream.
- Monitoring service issues message to recovery service that process has failed.
- This message should produce an event of alarm-worthy significance (that is, an alarm-level notification should probably be set for this event).
- <1> Recovery service analyzes operational state of failed process to determine necessary corrective action(s).
- Was process actively processing one or more messages? If not, no recovery is required beyond restarting.
- Identify side-effects to the extent that is possible. Relevant distributed state repository contents can be examined for changes since last message was successfully processed.
- Have in-process message(s) already been re-routed to other queues and handled? Handle any known side effects as well as possible given this current status, then re-queue messages.
- <2> Restart process if necessary.
- Other processes may already be taking up the slack, or the system may have already restarted the process when its absence was noted.
- <2> Notify Integrated Observatory Operator of identified impacts and corrective actions.
- Again generated event(s) should result in some alert(s).
Process has been recovered to maximum extent possible.
These comments provide additional context (usually quite technical) for editors of the use case.