Skip to end of metadata
Go to start of metadata

Overview of "Recover Failed Process" Use Case

Peer worker processes gracefully recover a failed process and its state.

Tip: Key Points
UC Priority= 4 or 5: Critical, is in R2
Only boldface steps are required
<#> before a step —> lower priority
(optional) —> run-time option

Related Jira Issues:   Open   •   All


Refer to the Product Description and Product Description Release 2 pages for metadata definitions.

Actors Integrated Observatory Operator
References UC.R1.25 Assure Reliability
Is Used By  
Is Extended By  
In Acceptance Scenarios AS.R2.04B Curate Data Products
Technical Notes This use case advances the R1 use case cited in the References.
The use case is an automated (internal) use case; the operator is involved only to receive notifications.
Lead Team COI
Primary Service Capability Container & Distributed Service Infrastructure Part 2
Version 1.6
UC Priority 3
UC Status Mapped + Ready
UX Exposure SYS


This information summarizes the Use Case functionality.

Peer worker processes in a service take over for a failed process, recover its conversation state, and roll back or compensate any side effects of incomplete processing. Use a distributed state repository to automatically recover failed processing. Compensate for non-idempotent actions (actions that cause side effects).


Initial State

Process has failed in a way that can be identified by monitoring services.

Scenario for "Recover Failed Process" Use Case

  1. Monitoring service detects that a process is non-performant.
    1. Process may have gone away, locked up (become non-responsive), or just stopped processing messages mid-stream.
  2. Monitoring service issues message to recovery service that process has failed.
    1. This message should produce an event of alarm-worthy significance (that is, an alarm-level notification should probably be set for this event).
  3. <1> Recovery service analyzes operational state of failed process to determine necessary corrective action(s).
    1. Was process actively processing one or more messages? If not, no recovery is required beyond restarting.
    2. Identify side-effects to the extent that is possible. Relevant distributed state repository contents can be examined for changes since last message was successfully processed.
    3. Have in-process message(s) already been re-routed to other queues and handled? Handle any known side effects as well as possible given this current status, then re-queue messages.
  4. <2> Restart process if necessary.
    1. Other processes may already be taking up the slack, or the system may have already restarted the process when its absence was noted.
  5. <2> Notify Integrated Observatory Operator of identified impacts and corrective actions.
    1. Again generated event(s) should result in some alert(s).

Final State

Process has been recovered to maximum extent possible.


These comments provide additional context (usually quite technical) for editors of the use case.

(click on # to go to R2 use case)
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61     27B


r2-usecase r2-usecase Delete
usecase usecase Delete
productdescription productdescription Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.