An evaluation of software fault tolerance in a practical system




















The recovery block system is also complicated by the fact that it requires the ability to roll back the state of the system from trying an alternate. This may be accomplished in a variety of ways, including hardware support for these operations. This try and rollback ability has the effect of making the software to appear extremely transactional, in which only after a transaction is accepted is it committed to the system.

There are advantages to a system built with a transactional nature, the largest of which is the difficult nature of getting such a system into an incorrect or unstable state. This property, in combination with checkpointing and recovery may aide in constructing a distributed hardware fault tolerant system.

The N-version software concept attempts to parallel the traditional hardware fault tolerance concept of N-way redundant hardware. In an N-version software system, each module is made with up to N different implementations. Each variant accomplishes the same task, but hopefully in a different way.

Each version then submits its answer to voter or decider which determines the correct answer, hopefully, all versions were the same and correct, and returns that as the result of the module. This system can hopefully overcome the design faults present in most software by relying upon the design diversity concept. An important distinction in N-version software is the fact that the system could include multiple types of hardware using multiple versions of software. The goal is to increase the diversity in order to avoid common mode failures.

Using N-version software, it is encouraged that each different version be implemented in as diverse a manner as possible, including different tool sets, different programming languages, and possibly different environments.

The various development groups must have as little interaction related to the programming between them as possible. The dependence on appropriate specifications in N-version software, and recovery blocks, can not be stressed enough. The delicate balance required by the N-version software method requires that a specification be specific enough so that the various versions are completely inter-operable, so that a software decider may choose equally between them, but cannot be so limiting that the software programmers do not have enough freedom to create diverse designs.

The flexibility in the specification to encourage design diversity, yet maintain the compatibility between versions is a difficult task, however, most current software fault tolerance methods rely on this delicate balance in the specification.

The N-version method presents the possibility of various faults being generated, but successfully masked and ignored within the system. It is important, however, to detect and correct these faults before they become errors. First, the classification of faults applied to N-version software method: if only a single version in an N-version system, the error is classified as a simplex fault.

If M versions within an N-version system have faults, the the fault is declared to be an M-plex fault. M-plex faults are further classified into two classes of faults of related and independent types. Detecting, classifying, and correcting faults is an important task in any fault tolerant system for long term correct operation.

The differences between the recovery block method and the N-version method are not too numerous, but they are important. In traditional recovery blocks, each alternative would be executed serially until an acceptable solution is found as determined by the adjudicator. The recovery block method has been extended to include concurrent execution of the various alternatives.

The N-version method has always been designed to be implemented using N-way hardware concurrently. In a serial retry system, the cost in time of trying multiple alternatives may be too expensive, especially for a real-time system.

Conversely, concurrent systems require the expense of N-way hardware and a communications network to connect them.

Another important difference in the two methods is the difference between an adjudicator and the decider. The recovery block method requires that each module build a specific adjudicator; in the N-version method, a single decider may be used. The recovery block method, assuming that the programmer can create a sufficiently simple adjudicator, will create a system which is difficult to enter into an incorrect state. The engineering tradeoffs, especially monetary costs, involved with developing either type of system have their advantages and disadvantages, and it is important for the engineer to explore the space to decide on what the best solution for his project is.

Self checking software is not a rigorously described method in the literature, but rather a more ad hoc method used in some important systems. Self-checking software are the extra checks, often including some amount checkpointing and rollback recovery methods added into fault-tolerant or safety critical systems. Other methods including separate tasks that "walk" the heap finding and correcting data defects and the options of using degraded performance algorithms.

While self-checking may not be a rigorous methodology, it has shown to be surprisingly effective. The obvious problem with self-checking software is its lack of rigor.

Code coverage for a fault tolerant system is unknown. Furthermore, just how reliable a system made with self-checking software? Without the proper rigor and experiments comparing and improving self-checking software cannot effectively be done.

Without software fault tolerance, it is generally not possible to make a truly fault tolerant system. This means, that a larger focus on software reliability and fault tolerance is necessary in order to ensure a fault tolerant system. An ultra-fault tolerant system needs software fault tolerance in order to create a system that is ultra-reliable.

These systems are very necessary for missions in which the system may not be accessible. For example, space missions, or very deep undersea communications systems, are not easily accessible. These missions require systems whose reliability ensures that the system will operate throughout its mission life.

Current software fault tolerance is based on traditional hardware fault tolerance, for better or worse. Both hardware and software fault tolerance are beginning to face the new class of problems of dealing with design faults. Hardware designers will soon face how to create a microprocessor that effectively uses one billion transistors; as part of that daunting task, making the microprocessor correct becomes more challenging.

In the future, hardware and software may cooperate more in achieving fault tolerance for the system as a whole. Software methodology may be one of the best ways to build in software fault tolerance. Building correct software would make large strides in system dependability. Using a system that is mostly correct, with some more simple fault tolerance techniques may be the best system solution in the future. The view that software has to have bugs will have to be conquered.

If software cannot be made at least relatively, bug free then the next generation of safety critical systems will be very flawed. Reliable computing systems, often used for transaction servers, made by companies like Tandem, Stratos, and IBM, have shown that reliable computers can currently be made, however, they have also demonstrated that the cost is significant.

Currently, the technologies used in these systems do not appear to scale well for the embedded market place. The solution may be that a networked world is indeed a better solution, in that reliable systems with humans watching over them, may be the final solution, and that ubiquitous networking to these reliable systems may solve the embedded fault tolerance issue.

Another possible panacea is the evolving application of degraded performance. While degraded performance may not be the ultimate solution, or acceptable in all cases, by limiting the amount of complexity necessary, it may go a long way toward being able to create correct and fault tolerant software. In the end, a solution that is cost effective enough to be applied to the embedded world of computing systems is in dire need.

As today's common appliances, including automobiles, become increasingly computer automated and relied upon by society, software fault tolerance becomes more necessary. Storey, Safety-Critical Computer Systems. Harlow, England: Addison-Wesley, Good introductory information on safety-critical computers. Lyu , ed. The authoritative book on the subject of software fault tolerance written by the experts in the field.

Murray, R. Fleming, P. Harry, and P. An interesting paper on distributed rollback and recovery. It mentions an single interesting possibility of fault tolerance. Gray and D. A good discussion of the number of software failures occuring in today's high-reliability systems.

Knight and N. SE, No. The original work on disputing the results that N-version programming works. A paper describing N-version programming written by the original creator of the concept. A good in depth discussion of the concept and how to apply it. Lee and R. Presentation of good quality commericial data of on an operating system that is supposed to be one of the most fault tolerant. Complex safety critical systems currently being designed and built are often difficult multi-disciplinary undertakings.

Such advanced driver assistance systems ADAS are inherently safety-critical and must tolerate failures in any subsystem. However, fault-tolerance in safety-critical systems has been traditionally supported by hardware replication, which is prohibitively expensive in terms of cost, weight, and size for the automotive market.

Recent work has studied the use of software-based fault-tolerance techniques that utilize task-level hot and cold standbys to tolerate fail-stop processor and task failures. The benefit of using standbys is maximal when a task and any of its standbys obey the placement constraint of not being co-located on the same processor.

We then introduce a task allocation algorithm that, for the first time to our knowledge, leverages the run-time attributes of cold standbys. Our empirical study finds that our heuristic uses no more than one additional processor in most cases relative to an optimal allocation that we construct for evaluation purposes using a creative technique.

We use this implementation to provide an experimental evaluation of our task-level fault-tolerance features.



0コメント

  • 1000 / 1000