Recovery Action Design Pattern

Intent

Decouple the management of individual failure recovery actions from their implementation by allowing recovery actions to be manipulated as abstract entities independent of the specific recovery actions they implement.

Based On

This pattern is derived from the failure recovery design pattern proposed by the AOCS Framework (see also: A. Pasetti, Embedded Control Systems and Software Frameworks, Springer-Verlag, 2002).

Motivation

Many OBS are capable of performing a certain amount of failure detection checks. The detection of a failure or suspected failure may lead to the execution of a recovery action. Thus, on-board systems often specify a number of failure detection checks and associate to each one or more recovery actions to be executed when the check fails.

The type of actions to be executed in response to a given failure obviously varies across applications but the way these actions are managed presents some similarities. Thus, in most cases, there is a requirement that it be possible to disable and enable individual recovery actions; that it be possible to react only to consecutive occurrences of the same action; that execution of the recovery action be recorded as an event; etc.

This design pattern allows these commonalities to be factored out by encapsulating recovery actions in objects that are indirectly instantiated from a base class that implements the invariant recovery action operations.

Dictionary Entries

The following abstractions or domain-wide concepts are defined to support the implementation of this design pattern:

Recovery Action

Structure

The recovery action design pattern represents the recovery action abstraction as an abstract interface RecoveryAction that defines the generic operations that can be performed on a generic recovery action. Concrete recovery actions are implemented as instances of classes that implement RecoveryAction. Recovery actions are therefore plug-in components and components that must execute them only see them as instances of the abstract type RecoveryAction.

It is often necessary to execute several recovery actions in response to the same failure. For this reason, the design pattern allows recovery actions to be linked together to form a chain. Clients only see one end of the chain and the operations they perform on it are automatically propagated to all members in the chain. Clients are thus unaware of whether they deal with a single recovery action or with a chain of linked recovery actions.

Participants

Client: The component that executes the recovery action or performs housekeeping operations (e.g. disabling and enabling) on it.
RecoveryAction: The abstract interface or base abstract class that defines the basic operations that can be performed on generic recovery actions.
ConcreteRecoveryAction: Component implementing (or derived from) RecoveryAction that represents a specific and concrete recovery action. At a minimum, it must provide an implementation for the doRecovery operation. Other base operations could in principle be inherited from an abstract RecoveryAction base class.

Collaborations

Typical operational scenarios for this design pattern are:

A component that may need to execute a recovery action, is loaded the recovery action component that implements it (as an instance of type RecoveryAction) and, when the conditions for the execution occur, calls its doRecovery method
A component that executes a command to disable or enable a recovery action, holds a reference to the recovery action (which is sees as an instance of type RecoveryAction) and, when the telecommand must be executed, calls the disable or enable method on it.

Consequences

Clients are decoupled from the implementation of recovery actions: they only see abstract recovery actions and only interact with them through the same interface. Changing the concrete recovery action that is associated to a certain component has no impact on it.
Functionalities that are common to all recovery actions (e.g. the management of the enable/disable status) can be placed in the base RecoveryAction class and can be coded only once.
Linked lists of recovery actions can be treated as if they were one single recovery action: the client is not - and need not be - aware of whether it is executing one single or several recovery actions.
It is possible to build a library of commonly recurring recovery actions and to use them within an application as ready-made components.
It is necessary to have a dedicated class for each concrete recovery action required by an application. This may lead to a proliferation of small classes.

Applicability

This design pattern is useful when:

components in an application need to execute and handle recovery actions
it is necessary to be able to vary the implementation of the recovery actions without affecting the components that execute or handle them

Implementation Issues

Conceptually, RecoveryAction is an abstract interface but instantiation of the pattern will often implement it as a base abstract class that provides concrete implementations for its housekeeping operations and leaves only doRecovery as an abstract operation to be defined in concrete subclasses.

Which operations should be defined at the level of RecoveryAction? The class diagram of the pattern considers only three types of operation but one might conceivably want to implement more (or less). For instance, in some on-board applications, failures that are detected only once (or only a small number of times) are treated differently from failures that recur in several consecutive operating cycles. The corresponding logic could be placed in a RecoveryAction base class. Similarly, execution of a recovery action should sometimes be recorded as an Event. The logic to create the event report could again be placed in a RecoveryAction base class.

Since they are encapsulated in objects, recovery actions can have memory. Thus, it is possible to make the execution of a recovery action conditional upon past executions. Consider for instance the case where recovery should only be performed if a certain failure conditions persist for two consecutive cycles (this is often done to avoid triggering of recovery actions in response to detection of spurious failures). In such a case, the simplest mechanism is to have a recovery action that returns without performing any action the first time it is called and that only executes some concrete action after it is called twice in a row (indicating that the failure is persistent).

The sample definition of interface RecoveryAction given in the class diagram of the design pattern, foresees methods to enable and disable individual recovery actions. There is sometimes a need to disable or enable all recovery actions. This type of requirement can be implemented by having static enable/disable methods.

In the concept proposed here, a recovery action is a punctual action that is executed in one-shot immediately after the fault has been detected. In some cases, however, the response to a fault must consist of a sequence of actions that may extend over several activation cycles. In such a case, the sequence of actions should be encapsulated in a manoeuvre and the recovery action will consist in loading the manoeuvre into the manoeuvre manager.

OBS Framework Mapping

The implementation of this design pattern in the OBS Framework is supported by the following classes:

RecoveryActionabstract interface --> RecoveryAction

Sample Code

Consider a recovery action associated to the detection of a transmission bus fault that specifies that, when the fault is detected, there should be a switchover to the redundant bus if the fault is sporadic or a fall-back to SBY mode if the fault is permanent. The fault is defined to be permament if it has occurred more than once. Use of the recovery action design pattern implies that a dedicated class be defined to encapsulate this recovery action. A tentative implementation for the doRecovery method for this class could be as follows:

	class BusFaultRecoveryAction : RecoveryAction {
	  bool alreadyTried=false;
	  . . .
	  void doRecovery() {
	    if (!alreadyTried) 	// sporadic fault
	    {	 . . .		// do switch over to redundant bus
		 alreadyTried = true;
	    }
	    else
		 . . .		// command fall-back to SBY
	  }
	}

The component that implements the bus fault check would then look like this:

	Class BusManager {
		RecoveryAction* busFaultRecoveryAction;
		. . .

		// Method to load recovery action as plug-in component
		void loadBusFaultRecoveryAction(RecoveryAction* ra) {
			busFaultRecoveryAction = ra;
		}

		// Method to perform the APS fault check
		void doBusFaultCheck {
			if (bus fault detected)
				busFaultRecoveryAction.doRecovery();
		}
		. . .
	}

This component sees the recovery action as a plug-in component that is loaded when the component is configured during the initialization phase. Consequently, its code is independent of which specific recovery action is executed in response to the bus fault. It is also independent of whether the fault is sporadic or permanent. The management of the sporadic/permanent status is done internally to the recovery action.

The configuration code for such a component could be as follows:

	BusFaultRecoveryAction* busFaultRecoveryAction;
	. . .
	BusManager* busManager = new BusManager();
	BusManager->loadBusFaultRecoveryAction(busManager);

Note that the recovery action is created as an instance of a specific recovery action class but is loaded into the client component as an instance of the generic abstract class RecoveryAction. As already mentioned earlier, the management of the sporadic/permanent status could be done at the level of an abstract base class RecoveryAction of the following kind:

	class RecoveryAction {
		int limit;
		int counter=0;
		bool isSporadic=true;
		. . .

		void setLimit(int l) {
			limit=l;
		}

		bool isSporadic() {
			return isSporadic;
		}

		void doRecovery() {
			limitCounter++;
			if (limitCounter>limit)
			{	isSporadic=false;
				limitCounter=0;
			}
		}
		. . .
	}

The implementation of the doRecovery method in a derived class would then be as follows:

    class ConcreteRecoveryAction : RecoveryAction {
        . . .

        void doRecovery() {
            RecoveryAction:doRecovery();
            if (isSporadic())
            . . .	// execute sporadic part of recovery action
            else
            . . .	// execute permanent part of recovery action
         }
    }

As a final example, consider the recovery action that requires execution of a complex manoeuvre extending over a prolonged period of time. This recovery action cannot be executed by a RecoveryAction object that, by definition, is activated only once and can only perform punctual actions within the same operational cycle. This type of situation can be handled by defining a Manoeuvre that is responsible for performing the recovery procedure and by having the recovery action load the manoeuvre into the manoeuvre manager. The corresponding RecoveryAction class can then be defined as follows:

    class ComplexRecoveryAction : RecoveryAction {

        Manoeuvre* recoveryProcedure;
        ManoeuvreManager* manoeuvreManager;
	. . .

        void doRecovery() {
	. . .
	manoeuvreManager->add(recoveryProcedure);
        }
    }

The loading of the manoeuvre in a sense allows the recovery action to extend the range of its action beyond the cycle where the fault was detected.

Note that the recoveryProcedure manoeuvre can also be used in contexts other than the execution of the recovery action.

Remarks

None

Author

A. Pasetti (P&P Software)

Last Modified

2003-05-19

OBS Framework

Implementation - Design Patterns