Centre for Software Reliability

Research on Diversity and Software Fault Tolerance at the CSR

Diversity - background and history


Follow these links for surveys produced at CSR about diversity for fault tolerance, concerning modelling and assessment, architectural and design issues, and ways for pursuing diversity in developing redundant software components. CSR also covers these topics in its 1-day and 5-day Continuous Professional Development (CPD) courses.


The use of diversity - doing things differently, in two or more ways, to protect against the failures of single procedures - has been ubiquitous in safety-critical industries for decades. In many of these applications, the benefits have been regarded as 'obvious', and it is only in more recent years that there have been formal models and studies of efficacy. Projects at CSR (DISCS, DOTS and the sequence of "DISPO", projects for the UK nuclear safety programme) have been at the forefront of this research work.

Early studies concerned component redundancy and diversity in hardware systems, and there is a huge literature on common mode faults, beta factors, etc. More recently (in the past 25 years) there has been considerable interest in the use of diversity in software-based systems. A driver for this research was the need for very highly reliable software, coupled with the realisation that there were severe difficulties in making a single version of a program very reliable (e.g. via reliability growth from extensive testing and debugging) (Miller, Morell et al. 1992; Littlewood and Strigini 1993). The use of multi-version software, developed independently and adjudicated at run-time, seemed a possible way out of the difficulties: early work in the field was probably motivated by an analogy with hardware redundancy.

There are some early applications of software diversity that appear to have been successful: examples include critical flight-control computers on Airbus aircraft (Briere and Traverse 1993); various railway signalling and control systems, see e.g. (Hagelin 1988). After experiencing many years of operational use, there seem to be no reports of catastrophic failure of these systems attributable to software design faults.

In spite of these successes as judged after the fact, there have been serious difficulties in assessing the reliability and safety of fault-tolerant diverse systems before deployment - which is precisely the problem faced by a regulator in deciding whether such systems are fit for purpose.

This difficult problem of assessment and prediction of the dependability of design-diverse fault tolerant software-based systems has been the subject of much research - both experimental and theoretical. In several experiments, for example, it has been established that it would be unreasonable to claim that diverse software versions fail independently (Knight and Leveson 1986; Eckhardt, Caglayan et al. 1991): you cannot expect that a 1-out-of-2 system built from channels each having pfds of 10-3 will have a pfd of 10-6. On the other hand, these experiments did show that there was some benefit from the fault tolerance. The Knight and Leveson experiment involved developing 27 versions and subjecting them to 1,000,000 test cases against an oracle version that was presumed correct. On each test case, a vector of 27 dimensions recorded the result - correct or incorrect - of each version. The authors were thus able to calculate the hypothetical reliabilities of fault tolerant architectures comprising different versions. For example, they examined all 2-out-of-3 systems that could be constructed, and found that the average reliability among these was an order of magnitude better than the average reliability of the 27 single versions.

The issue, then, becomes an empirical one. If it is not possible to assume independence of failures, with the resulting simplified mathematics, we need in each case to assess the degree of dependence between the failures of the different versions. This turns out to be very difficult.

References
  • Miller, Morell et al. 1992 - "Estimating the probability of failure when testing reveals no failures." IEEE Trans Software Engineering 18(1).
  • Littlewood and Strigini 1993 - "Assessment of ultra-high dependability for software-based systems." CACM 36(11): 69-80
  • Briere and Traverse 1993 - Airbus A320/A330/A340 Electrical Flight Controls - A Family Of Fault-Tolerant Systems. 23rd International Symposium on Fault-Tolerant Computing (FTCS-23), Toulouse, France, 22 - 24, IEEE Computer Society Press.
  • Hagelin 1988 - Ericsson safety system for railway control. Software Diversity in Computerised Control Systems. U. Voges. Vienna, Springer-Verlag: 11-22.
  • Knight and Leveson 1986 - "Experimental evaluation of the assumption of independence in multiversion software." IEEE Trans Software Engineering 12(1): 96-109.
  • Eckhardt, Caglayan et al. 1991 - "An experimental evaluation of software redundancy as a strategy for improving reliability." IEEE Trans Software Eng 17(7): 692-702.

Contributions of research conducted at the CSR

CSR has made substantial contributions about the modelling, assessment and design issues in diversity and software fault tolerance. We mentioned previously surveys produced at CSR about diversity for fault tolerance, concerning modelling and assessment, architectural and design issues, and ways for pursuing diversity in developing redundant software components. Recent studies have covered for instance: .

Papers produced in CSR on diversity

Papers produced in various projects on diversity: DISCS, DISPO, DOTS, DIRC and INDEED

_________________________________________________________________

Previous papers by researchers of CSR at City University (in chronological order):