Skip to main content
Log in

Encoding of primary structures of biological macromolecules within a data mining perspective

  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

An encoding method has a direct effect on the quality and the representation of the discovered knowledge in data mining systems. Biological macromolecules are encoded by strings of characters, calledprimary structures. Knowing that data mining systems usually use relational tables to encode data, we have then to reencode these strings and transform them into relational tables. In this paper, we do a comparative study of the existingstatic encoding methods, that are based on the Biologist know-how, and our newdynamic encoding one, that is based on the, construction ofDiscriminant and Minimal Substrings (DMS). Different classification methods are used to do this study. The experimental results show that ourdynamic encoding method is more efficient than thestatic ones, to encode biological macromolecules within a data mining perspective.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Dickerson R E, Geis I. The Structure and Actions of Proteins. Harper & Row Publishers, New York, NY, 1969, pp.16–17.

    Google Scholar 

  2. Hirsh J D, Sternberg M J E. Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks.Biochemistry, 1992, 31(32): 7211–7218.

    Article  Google Scholar 

  3. Hirsh H, Noordewier M. Using background knowledge to improve inductive learning of DNA sequences. InProc. the Tenth Conference on Artificial Intelligence for Applications, 1994, pp.351–357.

  4. Wang J T L, Marr T G, Shasha Det al. Discovering active motifs in sets of related protein sequences and using them for classification.Nucleic Acids Res., 1994, 22: 2769–2775.

    Article  Google Scholar 

  5. Qicheng M, Wang J T L, Gattiker J R. Mining biomolecular data using background knowledge and artificial neural networks.technical report.

  6. Quinlan J R. Learning efficient classification procedures and their application to chess end games. InMachine Learning: An AI Approach, Vol.1, Michalski R S, Carbonell J G, Mitchell T M (Eds.), 1983, pp.463–482.

  7. Towell G G. Symbolic knowledge and neural networks: Insertion, refinement and extraction [Dissertation]. Department of Computer Sciences, University of Wisconsin-Madison, 1991.

  8. Zurada J M. Introduction to Artificial Neural Systems. West Publishing Co., St. Paul, MN, 1992, pp.186–196.

    Google Scholar 

  9. Lu S Y, Fu K S. A sentence-to-sentence clustering procedure for pattern analysis.IEEE Trans. Systems, Man and Cybernetics, 1978, (8): 381–389.

    Article  MATH  MathSciNet  Google Scholar 

  10. O'Neill M C. Consensus methods for finding and ranking DNA binding sites.Journal of Molecular Biology, 1989, 207: 301–310.

    Article  Google Scholar 

  11. O'Neill M C, Chiafari F. Escherichia coli promoters. II. A spacing class-dependent promoter search protocol.J. Biol. Chem., 1989, 264: 5531–5534.

    Google Scholar 

  12. Fu H A study of amino acids binary codes.Master in Computer Sciences, University of Lille, France, 2001.

    Google Scholar 

  13. Maddouri M, Elloumi M. A data mining approach based on machine learning techniques to classify biological sequences.Knowledge Based Systems Journal, March 2002.

  14. Elloumi M, Maddouri M. Discrimination between two families of strings: Application to classification of primary structures of biological macromolecules. InProc. Second International Workshop on Biomolecular Informatics, Atlantic City, New Jersey, USA, February 2000.

  15. Karp R, Miller R E, Rosenberg A L. Rapid identification of repeated patterns in strings, trees and arrays. In4th Symposium of Theory of Computing, 1972, pp.125–136.

  16. Elloumi M. Analysis of strings coding biological macromolecules [Dissertation]. The University of Aix-Marseilles III. France, June 1994.

    Google Scholar 

  17. Weiss S M, Kulikowski C A. Computer Systems that Learn. Morgan-Kaufmann Publish., California, U.S.A., 1991.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mondher Maddouri.

Additional information

Mondher Maddouri received an B.S. degree in mathematics and physics in 1990, an M.S. degree in computer engineering in 1994 and a Ph.D. degree in computer science in 2000, from the Faculty of Sciences of Tunis, Tunisia. He is currently an associate professor in the Computer Science Department in the National Institute of Applied Sciences and Technologies, Tunis, Tunisia. His research interests are machine learning, knowledge discovery and data mining, and computational molecular biology.

Mourad Elloumi received an B.S. degree in mathematics and physics in 1984, and an M.S. degree in computer engineering in 1988, from the Faculty of Sciences of Tunis, Tunisia. He also received an M.S. degree in computer science in 1989, and a Ph.D. degree in computer science in 1994, from the University of Aix-Marseilles III, France. He is currently an associate professor in the Computer Science Department in the Faculty of Economic Sciences and Management of Tunis, Tunisia. His research interests are computational molecular biology, algorithmics, and knowledge discovery and data mining.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Maddouri, M., Elloumi, M. Encoding of primary structures of biological macromolecules within a data mining perspective. J. Comput. Sci. & Technol. 19, 78–88 (2004). https://doi.org/10.1007/BF02944786

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02944786

Keywords

Navigation