Fault Management Based on Quality of Service Criteria

		Douglas Wells
		The Open Group
		July, 2001

	This paper summarizes some thoughts on managing faults
	within large, complex, distributed object systems and
	describes initial results of an implementation experiment
	targeting real-time systems.  The proposed method incorporates
	fault management within a QoS-based resource management
	architecture that allows trade-offs among multiple QoS
	dimensions, including timeliness, integrity and reliability.
	Initial results of this process include a "fast failure
	detector," which can reliably detect host node failures in
	real-time systems with sub-second time constraints.

------------------------------------

Introduction

In recent years the DARPA Quorum program has sponsored research in
managing the end-to-end quality-of-service (QoS) characteristics
of mission-critical applications. This effort has resulted in the
development of technology that permits configuration decisions to
be delayed until execution time. This late binding allows execution-time
factors to be incorporated in the decision process, resulting in
more informed assignment of resources and overall more effective
defense systems.

The focus so far has been on guaranteeing essential services, often
utilizing redundant components when possible, or preempting resources
from other functions when necessary. The next step is to extend
these capabilities to large, complex systems, where not all
capabilities can be guaranteed, but where the objective is to
optimize the overall system -- to utilize the resources available
to produce the most valuable results. This is a goal common to both
commercial systems and military systems.

A characteristic of large, complex, distributed systems is that
one can not expect all of it to be up simultaneously. At almost
any point in time, some portions will be down, due to planned
maintenance or component failures. It is highly unlikely that these
outages will occur in such a manner as to maintain system optimality.
Traditionally, the issue of fault recovery has been addressed during
the system design phase: system engineers would select fault-tolerant
components for particularly critical subsystems, build in fall-back
strategies for less important functions, and allow other functions
to simply fail. The result is that run-time configuration choices
are limited and certain resources may be reserved and therefore
unavailable for more valuable activities.

Our goal for resource management should be to select the overall
best configuration based on the current situation and environment.
This requires system-wide analysis of objectives and resources,
including dynamic consideration of faulty components.  We propose
to incorporate fault management into these systems by utilizing
QoS characteristics such as integrity, availability, and reliability,
within a hierarchical resource management architecture.

Overview

QoS-based fault management, the systematic handling of faults and
failures within a system, necessarily incorporates traditional
concepts of failure detection, identification, analysis and
prediction. In addition, it must include functions for analyzing
component dependencies and for distributing failure and fault data
to interested parties, including resource managers and information
visualizers.

Large, complex systems comprise multiple applications with disparate
time scales. QoS-based fault handling must accommodate these
differences.  Real-time subsystems must respond under tight time
constraints. Resources needed for recovery must be available
immediately and will often need to be pre-reserved. Other activities
might allow time in which to dynamically consider alternative
strategies.  A run-time resource manager must balance these
requirements by determining which resources are best utilized by
holding them in reserve for the time-constrained subsystems.

Relationship to CORBA and ACE/TAO

CORBA -- and ACE/TAO -- provide an effective framework in which to
address the problems of constructing large, complex, distributed
systems. It provides a common typing system and a standard mechanism
for invoking services.  There is a common application namespace
and Common Facilities includes many necessary general capabilities.
In addition, its applicability to real-time systems has been proven.

At the same time, CORBA's strength, object orientation (OO),
interferes with the effective provision of end-to-end properties.
OO enforces opaqueness in order to encourage reuse -- a component
is only obligated to do what its documentation says it must do.
Thus, a component that does a function is good; a fault-tolerant
version of that component is better; but there is no insight into
the internal behavior of a commodity component.

Quorum fostered the development of translucent layers. In order to
allow the effective use of component reliability into run-time
resource management, we must identify fault-related QoS metrics
and characterize reusable, translucent object components. In much
the same way that BBN's Quality of Objects (QuO) project identified
performance characteristics and incorporated them into wrapper
objects, we must make fault-related information available to resource
management.

An Initial Experiment

We have been working with the Naval Surface Warfare Center (NSWC)
on a prototype battle defense system built from distributed
components.  A fundamental problem with such a distributed system
is that in order for the overall application to meet its time
constraints in the presence of failures, subtasks must operate with
much shorter time deadlines. In particular, use of group communication
techniques for scalability and fault tolerance require detection
of failed members an order of magnitude faster than the end-to-end
application specifications.  Rather than require that the entire
application be written to these requirements, we have isolated the
node failure detector into a separate component that can satisfy
the more stringent time constraints.  Written using real-time
programming techniques and utilizing reserved CPU resources, this
"fast failure detector" (FFD) employs a heartbeat with deadline
function to reliably detect host failures in sub-second time frames
even in the presence of competing CPU loads.

The FFD notifies applications about node failures and provides
regular reports to the resource manager on host status. The FFD
can also provide additional metrics, such as whether a host is in
danger of missing its heartbeat deadline, which would cause a "false
positive" failure indication.

Note that the FFD does not actually detect group member failures.
That would require that the special real-time programming techniques
be applied to the overall application. Instead, each FFD actually
detects failure of other FFD components on other nodes. Use of the
dependency tree information mentioned earlier then allows us to
reason about the effect on the overall system. In this case, failure
of the FFD is highly correlated to failure of the node upon which
it operates.  The mission-critical application will have been
extensively reviewed and tested so that the most likely cause of
its failure will be due to the failure of the underlying node,
which in the most relevant context will most likely be due to battle
damage, the intrusion of an foreign object into the host hardware.
Finally, we retain the original failure detection capability of
the group communication system, which continues to detect failures
of group members -- now due to less likely causes and with a much
lower probability of occurrence.

Conclusion

Our initial investigation, including the FFD experiment, has
supported our belief that knowledge about faults can be effectively
incorporated into resource allocation decisions and that the use
of this information can improve the coordination between applications
relative to resource sharing.  We are in the process of building
a real-time group communication product that uses the FFD, and we
hope to extend the concepts with further research in applying
hierarchical resource management in large, complex systems built
using CORBA (and ACE/TAO) technology.