Developing Classification Techniques from Biological Databases Using Simulated Annealing

  • B. de la Iglesia
  • J. J. Wesselink
  • V. J. Rayward-Smith
  • J. Dicks
  • I. N. Roberts
  • V. Robert
  • T. Boekhout
Part of the Applied Optimization book series (APOP, volume 86)


This paper describes new approaches to classification/identification of biological data. It is expected that the work may be extensible to other domains such as the medical domain or fault diagnostic problems. Organisms are often classified according to the value of tests which are used for measuring some characteristic of the organism. When selecting a suitable test set it is important to choose one of minimum cost. Equally, when classification models are constructed for the posterior identification of unnamed individuals it is important to produce optimal models in terms of identification performance and cost. In this paper, we first describe the problem of selecting an economic test set for classification. We develop a criterion for differentiation of organisms which may encompass fuzzy differentiability. Then, we describe the problem of using batches of tests sequentially for identification of unknown organisms, and we explore the problem of constructing the best sequence of batches of tests in terms of cost and identification performance. We discuss how metaheuristic algorithms may be used in the solution of these problems. We also present an application of the above to the problem of yeast classification and identification.


Classification Identification Minimum test set (MTS) Heuristic techniques. 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. J. A. Barnett. Identifying yeasts. Nature, 229 (578), 1971a.Google Scholar
  2. J. A. Barnett. Selection of tests for identifying yeasts. Nature, 232: 221–223, 1971b.CrossRefGoogle Scholar
  3. J. A. Barnett, R. W. Payne, and D. Yarrow. Yeasts: Characteristics and identification, Third Edition. Cambridge University Press, Cambrige, UK, 2000.Google Scholar
  4. B. de la Iglesia, J. C. W. Debuse, and V. J. Rayward-Smith. Discovering knowledge in commercial databases using modern heuristic techniques. In E. Simoudis, J. W. Han, and U. M. Fayyad, editors, Proceedings of the Second Int. Conf. on Knowledge Discovery and Data Mining. AAAI Press, 1996.Google Scholar
  5. B. de la Iglesia and V. J. Rayward-Smith. The discovery of interesting nuggets using heuristic techniques. In H. A. Abbass, R. A. Sarker, and C. S. Newton, editors, Data Mining: a Heuristic Approach. Idea group Publishing, USA, 2001.Google Scholar
  6. K. Deb, S. Agrawal, A. Pratap, and T. Meyarivan. A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II, 2000.Google Scholar
  7. J. C. W. Debuse and V. J. Rayward-Smith. Feature subset selection within a simulated annealing data mining algorithm. Journal of Intelligent Information Systems, 9: 57–81, 1997.CrossRefGoogle Scholar
  8. J.C.W. Debuse, B. de la Iglesia, C. M. Howard, and V. J. Rayward-Smith. Building the KDD roadmap: A methodology for knowledge discovery. In R. Roy, editor, Industrial Knowledge Management, pages 179–196. Springer-Verlag, London, 2000.Google Scholar
  9. C.M. Fonseca and P. J. Fleming. An overview of evolutionary algorithms in multiobjective optimisation. Evolutionary Comp, 3: 1–16, 1995.CrossRefGoogle Scholar
  10. M. R. Garey and D. S. Johnson. Computers and intractability: A guide to the theory of NP-completeness. Freeman, New York, 1979.zbMATHGoogle Scholar
  11. D. E. Goldberg. Genetic Algorithms in Search, Optimisation and Machine Learning. Addison-Wesley, Reading, Massachusetts, 1989.Google Scholar
  12. M. Hall. Correlation-based feature selection for machine learning, 1998.Google Scholar
  13. J. Horn and N. Nafpliotis. Multiobjective optimisation using the niched pareto genetic algorithm. Technical Report Illigal Report 93005, Illinois Genetic Algorithms Laboratory, University of Illinois, Urbana, Champaign, 1994.Google Scholar
  14. L. Hyafil and R. L. Rivest. Constructing optimal binary decision trees is npcomplete. Information Processing Letters, 5: 15–17, 1976.MathSciNetzbMATHCrossRefGoogle Scholar
  15. W. Jakob, M. Gorges-Schleuter, and Blume C. Applications of genetic algorithms to task planning and learning. In R. Manner and B. Manderick, editors, Parallel problem solving from Nature, 2, pages 291–300. North-Holland, 1992.Google Scholar
  16. R. M. Karp. Reducibility among combinatorial problems. In Complexity of Computer Communications. Plenum Press, New York, 1972.Google Scholar
  17. Igor Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In European Conference on Machine Learning, pages 171–182, 1994.Google Scholar
  18. Huan Liu, Hiroshi Motoda, and Manoranjan Dash. A monotonic measure for optimal feature selection. In European Conference on Machine Learning, pages 101–106, 1998.Google Scholar
  19. J. W. Mann. X-SAmson v1.5 developers manual. School of Information Systems Technical Report, University of East Anglia, UK, 1996.Google Scholar
  20. A. Osyczka. Computer aided multicriterion optimisation method. Advances in Modelling and Simulation, 3 (4): 41–52, 1985.Google Scholar
  21. R. J. Pankhurst, editor. Systematics Association Special Volume No. 7, Biological Identification with Computers. Academic Press, New York, 1975.Google Scholar
  22. G. T. Parks and I. Miller. Selective breeding in a multiobjective genetic algorithm. In A. E. Eiben, editor, Proceedings of the Fifth International Conference on Parallel Problem Solving from Nature. Springer-Verlag, 1998.Google Scholar
  23. R. W. Payne. Selection criteria for the construction of efficient diagnostic keys. Journal of Statistical Planning and Inference, 5: 27–36, 1981.MathSciNetCrossRefGoogle Scholar
  24. R. W. Payne. Construction of irredundant test sets. Applied Statistics, 40: 213–229, 1991.CrossRefGoogle Scholar
  25. R. W. Payne. The use of identification keys and diagnostic tables in statistical work. In COMPSTAT 1992: Proceedings in Computational Statistics,volume 2, Heidelberg, 1992. Physica-Verlag.Google Scholar
  26. R. W. Payne. Genkey, a program for construction and printing identification keys and diagnostic tables. Technical Report m00/42529, Rothamsted Experimental Station, Harpenden, Hertfordshire, 1993.Google Scholar
  27. R. W. Payne and T. J. Dixon. A study of selection criteria for constructing identification keys. In T. Havranek, Z. Sidak, and M. Novak, editors, COMPSTAT 1984: Proceedings in Computational Statistics, Vienna, 1984. Physica-Verlag.Google Scholar
  28. R. W. Payne and D. A. Preece. Identification keys and diagnostic tables: a review (with discussion). Journal of the Royal Statistical Society, 143: 253–292, 1981.MathSciNetGoogle Scholar
  29. R. W. Payne and C. J. Thompson. A study of criteria for constructing identification keys containing tests with unequal costs. Computational Statistics Quarterly, 1: 43–52, 1989.Google Scholar
  30. J. I. Pitt and A. D. Hocking. Fungi and food spoilage 2nd Edition. Mackie Academic and Professional, London, 1997.CrossRefGoogle Scholar
  31. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.Google Scholar
  32. J. R. Quinlan. Bagging, boosting, and C4.5. In Proc. of the Thirteenth National Conference on A.I. AAAI Press/MIT Press, 1996.Google Scholar
  33. A. P. Reynolds, J. L. Dicks, I. N. Roberts, J. J. Wesselink, B. de la Iglesia, V. Robert, T. Boekhout, and V.J Rayward-Smith. Algorithms for identification key generation and optimization with application to yeast identification. In Proceedings of EvoBIO-2003 LNCS, Volume 2611. Springer, 2003. (To appear).Google Scholar
  34. J. D. Schaffer. Multiple objective optimisation with vector evaluated genetic algorithms. In J. J. Grefenstette, editor, Proceedings of the First International Conference on Genetic Algorithms, pages 93–100, San Mateo, California, 1985. Morgan Kaufmann Publishers Inc.Google Scholar
  35. N. Srinivas and K. Deb. Multiobjective optimisation using non-dominated sorting in genetic algorithms. Evolutionary Computation, 2 (3): 221–248, 1994.CrossRefGoogle Scholar
  36. J. J. Wesselink, B. de la Iglesia, S. A. James, J. L. Dicks, I. N. Roberts, and V. J. Rayward-Smity. Determining a unique defining dna sequence for yeast species using hashing techniques. Bioinformatics, 18 (7): 1004–1010, 2002.CrossRefGoogle Scholar
  37. W. R. Willcox and S. P. Lapage. Automatic construction of diagnostic tables. Computer Journal, 15: 263–267, 1972.CrossRefGoogle Scholar
  38. H. J. Zimmermann. Fuzzy Set Theory and its applications. Kluwer Academic Publishers, London, 1991.zbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2003

Authors and Affiliations

  • B. de la Iglesia
    • 1
  • J. J. Wesselink
    • 1
  • V. J. Rayward-Smith
    • 1
  • J. Dicks
    • 2
  • I. N. Roberts
    • 3
  • V. Robert
    • 4
  • T. Boekhout
    • 4
  1. 1.School of Information SystemsUniversity of East AngliaNorwichEngland
  2. 2.John Innes CentreNorwich Research Park ColneyNorwichEngland
  3. 3.Institute of Food ResearchNorwich Research Park ColneyNorwichEngland
  4. 4.Centraalbureau voor SchimmelculturesUtrechtThe Netherlands

Personalised recommendations