Skip to main content

Mining Biomolecular Data Using Background Knowledge and Artificial Neural Networks

  • Chapter
Handbook of Massive Data Sets

Part of the book series: Massive Computing ((MACO,volume 4))

Abstract

Biomolecular data mining is the activity of finding significant information in protein, DNA and RNA molecules. The significant information may refer to motifs, clusters, genes, protein signatures and classification rules. This chapter presents an example of biomolecular data mining: the recognition of promoters in DNA. We propose a two-level ensemble of classifiers to recognize E. Coli promoter sequences. The first-level classifiers include three Bayesian neural networks that learn from three different feature sets. The outputs of the first-level classifiers are combined in the second level to give the final result. To enhance the recognition rate, we use the background knowledge (i.e., the characteristics of the promoter sequences) and employ new techniques to extract high-level features from the sequences. We also use an expectation-maximization (EM) algorithm to locate the binding sites of the promoter sequences. Empirical study shows that a precision rate of 95% is achieved, indicating an excellent performance of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 629.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 799.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 799.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  • T. L. Bailey. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 21: 51–83, 1995.

    Google Scholar 

  • A. Brazma, I. Jonassen, E. Ukkonen, and J. Viloi. Discovering patterns and subfamilies in biosequences. In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, pages 34–43, 1996.

    Google Scholar 

  • S. Brunak, J. Engelbrecht, and S. Knudsen. Prediction of human mrna donor and acceptor sites from the dna sequence. Journal of Molecular Biology, 220 (1): 49–65, 1991.

    Article  Google Scholar 

  • C. Burge and S. Karlin. Prediction of complete gene structures in human genomic dna. Journal of Molecular Biology, 268 (1): 78–94, 1997.

    Article  Google Scholar 

  • L. R. Cardon and G. D. Stormo. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned dna fragments. Journal of Molecular Biology, 223 (1): 159–170, 1992.

    Article  Google Scholar 

  • M. W. Craven and J. W. Shavlik. Machine learning approaches to gene recognition. IEEE Expert, 9 (2): 2–10, 1994.

    Article  Google Scholar 

  • A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39: 1–38, 1977.

    MathSciNet  MATH  Google Scholar 

  • T. G. Dietterich. Machine learning research: Four current directions. AI Magazine, 18 (4): 97–136, 1997.

    Google Scholar 

  • K. A Frenkel. The human genome project and informatics. Communications of the ACM, 34 (11): 41–51, 1991.

    Article  Google Scholar 

  • D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for dna sequences: Analysis of promoter sequences from escherichia coli. Journal of Molecular Biology, 186 (1): 117–128, 1985.

    Google Scholar 

  • H. Hirsh and M. Noordewier. Using background knowledge to improve inductive learning of dna sequences. In Proceedings of the Tenth Conference on Artificial Intelligence for Applications, pages 351–357, 1994.

    Chapter  Google Scholar 

  • J. D. Hirst and M. J. E. Sternberg. Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks. Biochemistry, 31 (32): 7211–7218, 1992.

    Article  Google Scholar 

  • A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. Hidden markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235 (5): 1501–1531, 1994.

    Article  Google Scholar 

  • D. Kulp, D. Haussier, M. G. Reese, and F. H. Eeckman. A generalized hidden markov model for the recognition of human genes in dna. In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, pages 134–142, 1996.

    Google Scholar 

  • C. E. Lawrence and A. A. Reilly. An expectation-maximization (em) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Structure, Function, and Genetics, 7: 41–51, 1990.

    Article  Google Scholar 

  • S. Lisser and H. Margalit. Compilation of e. coli mrna promoter sequences. Nucleic Acids Research, 21 (7): 1507–1516, 1993.

    Article  Google Scholar 

  • D. J. C. Mackay. Bayesian interpolation. Neural Computation, 4 (3): 415–447, 1992a.

    Article  MATH  Google Scholar 

  • D. J. C. Mackay. A practical bayesian framework for backprop networks. Neural Computation, 4 (3): 448–472, 1992b.

    Article  Google Scholar 

  • G. Mengeritsky and T. F. Smith. Recognition of characteristic patterns in sets of functionally equivalent dna sequences. Computer Applications in the Biosciences, 3 (3): 223–227, 1987.

    Google Scholar 

  • R. M. Neal. Bayesian Learning for Neural Networks. Number 118 in Lecture Notes in Statistics. Springer-Verlag, 1996.

    Google Scholar 

  • D. W. Opitz and J. W. Shavlik. Connectionist theory refinement: Genetically searching the space of network topologies. Journal of Artificial Intelligence Research, 6: 177–209, 1997.

    MATH  Google Scholar 

  • O. N. Ozoline, A. A. Deev, and M. V. Arkhipova. Non-canonical sequence elements in the promoter structure. cluster analysis of promoters recognized by escherichia coli rna polymerase. Nucleic Acids Research, 25 (23): 4703–4709, 1997.

    Article  Google Scholar 

  • A. G. Pedersen, P. Baldi, S. Brunak, and Y. Chauvin. Characterization of prokaryotic and eukaryotic promoters using hidden markov models. In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, pages 182–191, 1996.

    Google Scholar 

  • A. G. Pedersen and J. Engelbrecht. Investigations of escherichia coli promoter sequences with artificial neural networks: New signals discovered upstream of the transcriptional start point. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, pages 292–299, 1995.

    Google Scholar 

  • S. Salzberg. A decision tree system for finding genes in dna. Technical Report CS-97–03, Department of Computer Science, Johns Hopkins University, 1997a.

    Google Scholar 

  • S. Salzberg. A method for identifying splice sites and translational start sites in eukaryotic mrna. Computer Applications in the Biosciences, 13 (4): 365–376, 1997b.

    Google Scholar 

  • T. D. Schneider and R. M. Stephens. Sequence logos: A new way to display consensus sequences. Nucleic Acids Research,18(20):6097–6100, 1990.

    Article  Google Scholar 

  • R. Staden. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Research, 18 (20): 6097–6100, 1990.

    Article  Google Scholar 

  • J. T. L. Wang, T. G. Marr, D. Shasha, B. A. Shapiro, and G. Chirn. Discovering active motifs in sets of related protein sequences and using them for classification. Nucleic Acids Research, 22 (14): 2769–2775, 1994.

    Article  Google Scholar 

  • J. T. L. Wang, T. G. Marr, D. Shasha, B. A. Shapiro, G. Chirn, and T. Y. Lee. Complementary classification approaches for protein sequences. Protein Engineering, 9 (5): 381–386, 1996.

    Article  Google Scholar 

  • J. T. L. Wang, S. Rozen, B. A. Shapiro, D. Shasha, Z. Wang, and M. Yin. New techniques for dna sequence classification. Journal of Computational Biology, 6 (2): 209–218, 1999a.

    Article  Google Scholar 

  • J. T. L. Wang, B. A. Shapiro, and D. Shasha. Pattern Discovery in Biomolecular Data: Tools, Techniques and Applications. Oxford University Press, New York, 1999b.

    Google Scholar 

  • C. H. Wu, M. Berry, Y. S. Fung, and J. McLarty. Neural networks for full-scale protein sequence classification: Sequence encoding with singular value decomposition. Machine Learning, 21:177–193, 1995.

    Google Scholar 

  • Y. Xu, R. J. Mural, J. R. Einstein, M. B. Shah, and E. C. Uberbacher. Grail: A multi-agent neural network system for gene identification. Proceedings of the IEEE, 84 (10): 1544–1551, 1996.

    Article  Google Scholar 

  • M. O. Zhang and T. G. Marr. A weight array method for splicing signal analysis. Computer Applications in the Biosciences, 9(5):499–509, 1993.

    Google Scholar 

  • X. Zhang, J. P. Mesirov, and D. L. Waltz. Hybrid system for protein secondary structure prediction. Journal of Molecular Biology, 225 (4): 1049–1063, 1992.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Ma, Q., Wang, J.T.L., Gattiker, J.R. (2002). Mining Biomolecular Data Using Background Knowledge and Artificial Neural Networks. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_30

Download citation

  • DOI: https://doi.org/10.1007/978-1-4615-0005-6_30

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-4882-5

  • Online ISBN: 978-1-4615-0005-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics