Research on Diversity and Software Fault Tolerance at the CSR
Diversity - background and history
Follow these links for surveys produced at CSR about diversity for fault tolerance, concerning modelling and assessment, architectural and design issues, and ways for pursuing diversity in developing redundant software components. CSR also covers these topics in its 1-day and 5-day Continuous Professional Development (CPD) courses.
The use of diversity - doing things differently, in two or more ways, to protect against the failures of single procedures - has been ubiquitous in safety-critical industries for decades. In many of these applications, the benefits have been regarded as 'obvious', and it is only in more recent years that there have been formal models and studies of efficacy. Projects at CSR (DISCS, DOTS and the sequence of "DISPO", projects for the UK nuclear safety programme) have been at the forefront of this research work.
Early studies concerned component redundancy and diversity in hardware systems, and there is a huge literature on common mode faults, beta factors, etc. More recently (in the past 25 years) there has been considerable interest in the use of diversity in software-based systems. A driver for this research was the need for very highly reliable software, coupled with the realisation that there were severe difficulties in making a single version of a program very reliable (e.g. via reliability growth from extensive testing and debugging) (Miller, Morell et al. 1992; Littlewood and Strigini 1993). The use of multi-version software, developed independently and adjudicated at run-time, seemed a possible way out of the difficulties: early work in the field was probably motivated by an analogy with hardware redundancy.
There are some early applications of software diversity that appear to have been successful: examples include critical flight-control computers on Airbus aircraft (Briere and Traverse 1993); various railway signalling and control systems, see e.g. (Hagelin 1988). After experiencing many years of operational use, there seem to be no reports of catastrophic failure of these systems attributable to software design faults.
In spite of these successes as judged after the fact, there have been serious difficulties in assessing the reliability and safety of fault-tolerant diverse systems before deployment - which is precisely the problem faced by a regulator in deciding whether such systems are fit for purpose.
This difficult problem of assessment and prediction of the dependability of design-diverse fault tolerant software-based systems has been the subject of much research - both experimental and theoretical. In several experiments, for example, it has been established that it would be unreasonable to claim that diverse software versions fail independently (Knight and Leveson 1986; Eckhardt, Caglayan et al. 1991): you cannot expect that a 1-out-of-2 system built from channels each having pfds of 10-3 will have a pfd of 10-6. On the other hand, these experiments did show that there was some benefit from the fault tolerance. The Knight and Leveson experiment involved developing 27 versions and subjecting them to 1,000,000 test cases against an oracle version that was presumed correct. On each test case, a vector of 27 dimensions recorded the result - correct or incorrect - of each version. The authors were thus able to calculate the hypothetical reliabilities of fault tolerant architectures comprising different versions. For example, they examined all 2-out-of-3 systems that could be constructed, and found that the average reliability among these was an order of magnitude better than the average reliability of the 27 single versions.
The issue, then, becomes an empirical one. If it is not possible to assume independence of failures, with the resulting simplified mathematics, we need in each case to assess the degree of dependence between the failures of the different versions. This turns out to be very difficult.
References
- Miller, Morell et al. 1992 - "Estimating the probability of failure when testing reveals no failures." IEEE Trans Software Engineering 18(1).
- Littlewood and Strigini 1993 - "Assessment of ultra-high dependability for software-based systems." CACM 36(11): 69-80
- Briere and Traverse 1993 - Airbus A320/A330/A340 Electrical Flight Controls - A Family Of Fault-Tolerant Systems. 23rd International Symposium on Fault-Tolerant Computing (FTCS-23), Toulouse, France, 22 - 24, IEEE Computer Society Press.
- Hagelin 1988 - Ericsson safety system for railway control. Software Diversity in Computerised Control Systems. U. Voges. Vienna, Springer-Verlag: 11-22.
- Knight and Leveson 1986 - "Experimental evaluation of the assumption of independence in multiversion software." IEEE Trans Software Engineering 12(1): 96-109.
- Eckhardt, Caglayan et al. 1991 - "An experimental evaluation of software redundancy as a strategy for improving reliability." IEEE Trans Software Eng 17(7): 692-702.
Contributions of research conducted at the CSR
CSR has made substantial contributions about the modelling, assessment and design issues in diversity and software fault tolerance. We mentioned previously surveys produced at CSR about diversity for fault tolerance, concerning modelling and assessment, architectural and design issues, and ways for pursuing diversity in developing redundant software components. Recent studies have covered for instance:- diversity in human-machine systems, with a case study on the dependability of Computer Aided Detection (CAD) in mammography;
- the impact of different choices of programming languages for software versions;
- the use of diversity in arguments to support dependability claims;
- the impact of diverse Anti-Virus engines in malware detection;
- diversity among complex off-the-shelf packages (database servers) and its use to improve dependability and performance;
- the use of SQL rephrasing rules to improve the dependability and fault tolerance of a database server;
Papers produced in CSR on diversity
Papers produced in various projects on diversity: DISCS, DISPO, DOTS, DIRC and INDEED:
- I. Gashi, C. Leita, V.Stankovic, O. Thonnard, "An Experimental Study of Diversity with Off-The-Shelf AntiVirus Engines", in Proc. NCA'09, the 8th IEEE International Symposium on Network Computing and Applications, IEEE Computer Society Press, accepted for publication.
- I. Gashi, P. Popov, V. Stankovic, "Uncertainty Explicit Assessment of Off-The-Shelf Software: A Bayesian Approach", in Elsevier Journal of Information and Software Technology, Elsevier, 51 (2), pp. 497–511, 2009. Abstract
- M.J.P. van der Meulen, M. A. Revilla, "The Effectiveness of Software Diversity in a Large Population of Programs," IEEE Trans Software Engineering, vol. 34, no. 6, pp. 753-764, Nov./Dec. 2008. Abstract
- B. Littlewood, P. Popov, L. Strigini, N. Shryane, "Modelling the Effects of Combining Diverse Software Fault Detection Techniques", Formal Methods and Testing: An outcome of the FORTEST Network Revised Selected Papers, (Hierons, R. M., Bowen, J. P., Harman, M., Eds.), vol. LNCS 4949, pp. 345 - 366, Springer, Berlin Heidelberg, 2008. Abstract
- P. Bishop, I. Gashi, B. Littlewood, D. Wright, "Reliability Modelling of a 1-Out-Of-2 System: Research with Diverse Off-The-Shelf SQL Database Servers", in Proc. ISSRE-2007, 18th International Symposium on Software Reliability Engineering, Trollhattan, Sweden, IEEE Computer Society Press, 2007, pp. 49-58. Abstract
- I. Gashi, P.Popov, L.Strigini, "Fault tolerance via diversity for off-the-shelf products: a study with SQL database servers", IEEE Transactions on Dependable and Secure Computing, IEEE Computer Society Press, 4(4), 2007, pp. 280-294. Abstract
- I. Gashi, P.Popov, "Uncertainty Explicit Assessment of Off-the-Shelf Software: Selection of an Optimal Diverse Pair", in Proc. ICCBSS-2007, Sixth International Conference on COTS Based Software Systems, Banff, Alberta, Canada, IEEE Computer Society Press, 2007, pp: 93-102. Abstract
- K. Salako, "Bounds on the Reliability of Fault-Tolerant Software Built by Forcing Diversity", Computer Safety, Reliability and Security: 26th International Conference, Safecomp 2007, Proceedings, Lecture Notes In Computer Science, (Saglietti F., Oster N., Eds.), pp. 411-416, Springer-Verlag, Nuremberg, 2007. Abstract
- B. Littlewood, D. Wright, "The use of multi-legged arguments to increase confidence in safety claims for software-based systems: a study based on a BBN of an idealised example", IEEE Trans Software Engineering, vol 33, no 5, pp 347-365, 2007. Abstract
- I. Gashi, P.Popov, "Rephrasing Rules for Off-The-Shelf SQL Database Servers", Proc. 6th European Dependable Computing Conference, 18-20 October 2006, IEEE Computer Society, Coimbra, Portugal, pp:139-148. Abstract
- V. Stankovic, P.Popov, "Improving DBMS Performance through Diverse Redundancy", Proc. 25th IEEE Symp. on Reliable Distributed Systems (SRDS'06), 2-4 October 2006, IEEE Computer Society, Leeds, UK, pp:391-400. Abstract
- M.J.P. van der Meulen, L. Strigini, M. Revilla, "On the Effectiveness of Run-Time Checks", Safecomp, 28-30 September 2005, Halden, Norway, (Gran, B.A., Winther, R., Eds.), In print. Lecture Notes in Computer Science, Springer-Verlag.
- A. Gorbenko, V. Kharchenko, P. Popov, A. Romanovsky, "Dependable Composite Web Services with Components Upgraded Online", in "Architecting Dependable Systems ADS III", (R. de Lemos, C. Gacek, A.Romanovsky, Eds.), vol. LNCS 3549, pp. 96-128 Lecture Notes in Computer Science, Springer-Verlag.
- L. Strigini, "Fault Tolerance Against Design Faults", in Dependable Computing Systems: Paradigms, Performance Issues, and Applications, (Hassan Diab and Albert Zomaya, Eds.), pp. 21-241, J. Wiley & Sons, 2005. Abstract
- M.J.P. van der Meulen and M. Revilla, "The effectiveness of Choice of Programming Language as a Diversity Seeking Decision", 5th European Dependable Computing Conference (EDDC-5), Budapest, Hungary, April 2005, (Dal Cin, M., Kaaniche, M., Pataricza, A., Eds.), pp. 199-209, Lecture Notes in Computer Science, Springer-Verlag, 2005.
- P. Popov and B. Littlewood, "The effect of testing on the reliability of fault-tolerant software", Proc International Conference on Dependable Systems and Networks (DSN2004), pp. 265-274, IEEE Computer Society, 2004. Abstract
- I. Gashi, P. Popov, V. Stankovic and L. Strigini, "On Designing Dependable Services with Diverse Off-The-Shelf SQL Servers", in Architecting Dependable Systems, (R. de Lemos, C. Gacek and A. Romanovsky, Eds.), pp. 191-214, Lecture Notes in Computer Science, Springer-Verlag , 2004. Abstract
- I. Gashi, P. Popov, and L. Strigini, "Fault diversity among off-the-shelf SQL database servers", Proc. DSN 2004, International Conference on Dependable Systems and Networks, Florence, Italy, 2004, pp. 389-398. Abstract
- V. Kharchenko, P. Popov and A. Romanovsky,"On Dependability of Composite Web Services with Components Upgraded Online" In Int. Conf. on Dependable Systems and Networks (DSN '04 - Workshop supplement), pages 287-291, Florence, Italy, 2004. Abstract
- P. Popov, L. Strigini, A. Kostov, V. Mollov and D. Selensky, "Software Fault-Tolerance with Off-the-Shelf SQL Servers", 3rd International Conference on COTS-Based Software Systems (ICCBSS `04), Redondo Beach, California, U.S.A., Lecture Notes in Computer Science, Volume 2959, Springer-Verlag, Springer-Verlag, 2004, pp. 117-126. Abstract
- J.G.W. Bentley, P.G. Bishop, M.J.P. van der Meulen, "An Empirical Exploration of the Difficulty Function", Safecomp, Potsdam, 21.-24 Sep. 2004, Potsdam, Germany, pp. 60-71, Springer-Verlag.
- M.J.P. van der Meulen, P.G. Bishop, M. Revilla, "An Exploration of Software Faults and Failure Behaviour in a Large Population of Programs", ISSRE 2004, Rennes, France, pp. 101-12, 2004, Abstract
- B. Littlewood and L. Strigini, "Redundancy and diversity in security", ESORICS 2004, 9th European Symposium on Research in Computer Security, Sophia Antipolis, France, Springer-Verlag LNCS 3193, 2004, pp. 423-438. Abstract
-
L. Strigini, A. Povyakalo and E. Alberdi. "Human-machine diversity in
the use of computerised advisory systems: a case study", Proc. DSN 2003, International Conference
on Dependable Systems and Networks, San Francisco, U.S.A., 2003, pp. 249-258.
Abstract
Other papers on human-machine diversity (medical systems): DIRC mammography case study - R. Bloomfield and B. Littlewood, "Multi-legged arguments: the impact of diversity upon confidence in dependability arguments", Proc. DSN 2003, International Conference on Dependable Systems and Networks, San Francisco, U.S.A., IEEE Computer Society, 2003, pp. 25-34. Abstract
- P. Popov and L. Strigini, "Diversity with Off-The-Shelf Components: A Study with SQL Database Servers", DSN 2003, International Conference on Dependable Systems and Networks - Fast Abstracts supplement, San Francisco, U.S.A., 2003, pp. B84-B85. Abstract
- P. Popov, L. Strigini, J. May, and S.Kuball, "Estimating Bounds on the Reliability of Diverse Systems", IEEE Transactions on Software Engineering, vol. SE-29, no. 4, 2003, pp.345-359 Abstract.
- B. Littlewood, P. Popov, L. Strigini, "Assessing the reliability of diverse fault-tolerant software-based systems", Safety Science, vol. 40, pp. 781-796, Pergamon, 2002.
- P. Popov, "Reliability Assessment of Legacy Safety-Critical Systems Upgraded with Off-the-Shelf Components", SAFECOMP'2002, September 2002, Catania, Italy, Lecture Notes in Computer Science, Springer-Verlag. Abstract.
- D. Bosio, B. Littlewood, M.J. Newby and L. Strigini. "Advantages of open source processes for reliability: clarifying the issues", Presented at Workshop on Open Source Software Development, Newcastle upon Tyne, February 2002. Abstract
- B. Littlewood, P. Popov, L. Strigini, "Design Diversity: an Update from Research on Reliability Modelling", Proc. Safety-Critical Systems Symposium 2001, Bristol, UK, Springer-Verlag. Abstract.
- B. Littlewood, P. Popov, L. Strigini, "Modelling software design diversity - a review", ACM Computing Surveys, Vol. 33, No. 2, June 2001, pp. 177-208. ACM. Abstract.
- P. Popov, L. Strigini, S. Riddle and A. Romanovsky, "Protective Wrapping of OTS Components", 4th ICSE Workshop on Component-Based Software Engineering: Component Certification and System Prediction, Toronto, May 2001.
- P. Popov, L. Strigini, S. Riddle and A. Romanovsky, "On Systematic Design of Protectors for Employing OTS Items", 27th Euromicro Conference, Workshop on Component-Based Software Engineering, Warsaw, Poland, 4-6 September 2001, pp. 22-29.
- B. Littlewood, P. Popov, L. Strigini and Nick Shryane, "Modelling the effects of combining diverse software fault removal techniques", IEEE Transactions on Software Engineering, Vol. SE-26, No. 12, pp.1157-1167, 2000. Abstract.
- B. Littlewood, P. Popov, L. Strigini, "Assessment of the Reliability of Fault-Tolerant Software: A Bayesian Approach", SAFECOMP'2000, October 2000, Rotterdam, Holland, Lecture Notes in Computer Science, No. 1943, ISBN 3-540-41186-0, pp. 294-308, Springer-Verlag. Abstract.
- B. Littlewood, P. Popov, L. Strigini, "Assessing the Reliability of Diverse Fault-Tolerant Systems", Proc. INucE International Conference on Control and Instumentation in Nuclear Installations, Bristol, UK, 2000. Abstract.
- B. Littlewood, "The use of proof in diversity arguments", IEEE Transactions on Software Engineering, Vol. SE-26, No.10, pp.1022-1023, 2000. Abstract.
- P. Popov, L. Strigini, B. Littlewood "Choosing between Fault-Tolerance and Increased V&V for Improving Reliability", Proc. International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'2000), pp. 535-540, ISBN 1-892512-22-x, June 26-29, 2000, Monte Carlo Resort, Las Vegas, Nevada, USA. Abstract.
- B. Littlewood, P. Popov, L. Strigini, "N-version Design Versus one Good Version", Proc. International Conference on Dependable Systems and Networks DSN'2000 (FTCS-30, DCCA-8) - Fast Abstracts, New York, USA, 2000, p. B42-B43. Abstract.
- P. Popov, L. Strigini and A. Romanovsky, "Diversity for off-the-shelf Components", Proc. International Conference on Dependable Systems and Networks DSN'2000 (FTCS-30, DCCA-8) - Fast Abstracts, New York, USA, 2000, p. B60-B61. Abstract.
- L. Strigini, B. Littlewood, "A discussion of practices for enhancing diversity in software designs", Centre for Software Reliability, Technical Report LS_DI_TR_04, 2000. Abstract.
- B. Littlewood, P. Popov, L. Strigini, "A note on modelling functional diversity", "Reliability Engineering an System Safety", 66, (1999), pp. 93-95. Abstract.
- P. Popov, L. Strigini and A. Romanovsky, "Choosing Effective Methods for Design Diversity - How to progress from Intuition to Science", 18th International Conference, SAFECOMP'99, held in Toulouse, France, September 1999. "Lecture Notes in Computer Science series", No. 1698, ISBN 3-540-66488-2, pp. 272-285, Springer-Verlag. Abstract.
- P. Popov, L. Strigini, "The Reliability of diverse systems: a contribution using modelling of the fault-creation process", Centre for Software reliability, Technical report, City University, 1999. Abstract.
- P. Popov and L. Strigini, "Conceptual Models for the Reliability of Diverse Systems - New Results", FTCS'28, Munich, Germany, June 1998. Abstract.
- P. Popov, L. Strigini and M. Pizza, "The efficacy of Design Diversity against Design Error: some practical considerations", 3rd International Conference on Control & Instrumentation in Nuclear Installations, Edinburgh, 12-14 May, 1998. Abstract.
- M. Pizza, L. Strigini, "Comparing the effectiveness of testing methods in improving programs: the effect of variations in program quality", Proc. 9th International Symposium on Software Reliability Engineering, ISSRE'98, Paderborn, Germany, IEEE Computer Society Press, 1998, p. 144-153. More can be found Abstract.
_________________________________________________________________
Previous papers by researchers of CSR at City University (in chronological order):
- P.G. Bishop, "Project on Diverse Software – An Experiment in Software Reliability", in Proc. 4th IFAC Workshop SAFECOMP'85, Como, Italy, 1985, pp. 153-158.
- L. Strigini and A. Avizienis, " Software Fault-Tolerance and Design Diversity: Past Experience and Future Evolution", in Proc. 4th IFAC Workshop SAFECOMP'85, Como, Italy, 1985, pp. 167-172.
- P.G. Bishop, D.G. Esp, M. Barnes, P. Humphreys, G., Dahll, J. Lahti, "PODS-A Project on Diverse Software", IEEE Trans. Software Engineering, vol. SE-12, no. 19, 1986, pp. 929-941.
- P.G. Bishop, "The PODS Diversity Experiment", in Dependable Computing and Fault Tolerant Systems, Vol. 2, (ed. U. Voges), Springer Verlag, ISBN 0-387-82014, New York-Wien, 1987, pp 51-84.
- B. Littlewood and D. R. Miller, "A conceptual model of multi-version software", in Proc. 17th International Symposium on Fault-Tolerant Computing (FTCS-17), Pittsburgh, Pennsylvania, 1987, pp. 150-155.
- B. Littlewood and D. R. Miller, "A conceptual model of the effect of diverse methodologies on coincident failures in multi-version software", in Proc. 3rd International GI/ITG/GMA Conference on Fault-Tolerant Computing Systems, Bremerhaven, Germany, 1987, pp. 263-272.
- P.G.Bishop, F.D. Pullen, "PODS Revisited - A Study of Software Failure Behaviour", Eighteenth Fault Tolerant Computing Symposium (FTCS-18),Tokyo, June, IEEE Computer Society Press, ISBN 0-8186-0867-6, 1988, pp. 2-8.
- B. Littlewood, D.R. Miller, "Conceptual modelling of coincident failures in multi-version software", IEEE Trans Software Engineering, Vol. 15, No 12, Dec 1989, pp 1596-1614.
- P.G. Bishop, F.D. Pullen, "Failure Masking - A Source of Dependency in Multi-Version Programming", Int. Working Conference on Dependable Computing for Critical Applications (DCCA), August, Santa Barbara, USA, IEEE Computer Society Press, 1989, pp. 25-32.
- F. Di Giandomenico and L. Strigini, "Adjudicators for Diverse-Redundant Components", in Proc. 9th Symposium of Reliable Distributed Systems (SRDS-9), Huntsville, Alabama, 1990, pp. 114-123.
- L. Strigini and F. Di Giandomenico, "Flexible schemes for application-level fault tolerance", in Proc. 10-th IEEE Symposium on Reliable Distributed Systems, Pisa, Italy, 1991, pp. 86-95.
- A. Bondavalli, J. Stankovic and L. Strigini, "Adaptable Fault Tolerance for Real-Time Systems", in D. Fussell (Ed.) "Responsive Computer Systems: Toward Integration of Fault-tolerance and Real-time", Kluwer Academic Publishers, 1994.
- P.G. Bishop, "Software Fault Tolerance by Design Diversity", in Software Fault Tolerance (ed. M. Lyu), 1995, Wiley, USA, Springer, ISBN 0-471-95068-8, pp. 211-229.
- A. Bondavalli, S. Chiaradonna, F. Di Giandomenico and L. Strigini, "Dependability Models for Iterative Software Considering Correlation among Successive Inputs", in Proc. IEEE International Symposium on Computer Performance and Dependability (IPDS'95), Erlangen, Germany, 1995, pp. 13-21.
- K. B. Djambazov and P. Popov, "The effects of testing on the reliability of single version and 1-out-of-2 software", in Proc. 6th Int. Symposium on Software Reliability Engineering, ISSRE'95, Toulouse, 1995, pp. 219-228.
- B. Littlewood, "The impact of diversity upon common mode failures", Reliability Engineering and System Safety, 51, pp. 101-113, 1996.
- L. Strigini, F. Di Giandomenico and A. Romanovsky, "Coordinated backward recovery between client processes and data servers", IEE Proceedings on Software Engineeering, 144, pp. 134-146, 1997.