Mining Biomolecular Data Using Background Knowledge and Artificial Neural Networks

Ma, Qicheng; Wang, Jason T. L.; Gattiker, James R.

doi:10.1007/978-1-4615-0005-6_30

Qicheng Ma³,
Jason T. L. Wang³ &
James R. Gattiker⁴

Part of the book series: Massive Computing ((MACO,volume 4))

511 Accesses
3 Citations

Abstract

Biomolecular data mining is the activity of finding significant information in protein, DNA and RNA molecules. The significant information may refer to motifs, clusters, genes, protein signatures and classification rules. This chapter presents an example of biomolecular data mining: the recognition of promoters in DNA. We propose a two-level ensemble of classifiers to recognize E. Coli promoter sequences. The first-level classifiers include three Bayesian neural networks that learn from three different feature sets. The outputs of the first-level classifiers are combined in the second level to give the final result. To enhance the recognition rate, we use the background knowledge (i.e., the characteristics of the promoter sequences) and employ new techniques to extract high-level features from the sequences. We also use an expectation-maximization (EM) algorithm to locate the binding sites of the promoter sequences. Empirical study shows that a precision rate of 95% is achieved, indicating an excellent performance of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 629.00; Price excludes VAT (USA)

Softcover Book: USD 799.99; Price excludes VAT (USA)

Hardcover Book: USD 799.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Bibliography

T. L. Bailey. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 21: 51–83, 1995.
Google Scholar
A. Brazma, I. Jonassen, E. Ukkonen, and J. Viloi. Discovering patterns and subfamilies in biosequences. In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, pages 34–43, 1996.
Google Scholar
S. Brunak, J. Engelbrecht, and S. Knudsen. Prediction of human mrna donor and acceptor sites from the dna sequence. Journal of Molecular Biology, 220 (1): 49–65, 1991.
Article Google Scholar
C. Burge and S. Karlin. Prediction of complete gene structures in human genomic dna. Journal of Molecular Biology, 268 (1): 78–94, 1997.
Article Google Scholar
L. R. Cardon and G. D. Stormo. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned dna fragments. Journal of Molecular Biology, 223 (1): 159–170, 1992.
Article Google Scholar
M. W. Craven and J. W. Shavlik. Machine learning approaches to gene recognition. IEEE Expert, 9 (2): 2–10, 1994.
Article Google Scholar
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39: 1–38, 1977.
MathSciNet MATH Google Scholar
T. G. Dietterich. Machine learning research: Four current directions. AI Magazine, 18 (4): 97–136, 1997.
Google Scholar
K. A Frenkel. The human genome project and informatics. Communications of the ACM, 34 (11): 41–51, 1991.
Article Google Scholar
D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for dna sequences: Analysis of promoter sequences from escherichia coli. Journal of Molecular Biology, 186 (1): 117–128, 1985.
Google Scholar
H. Hirsh and M. Noordewier. Using background knowledge to improve inductive learning of dna sequences. In Proceedings of the Tenth Conference on Artificial Intelligence for Applications, pages 351–357, 1994.
Chapter Google Scholar
J. D. Hirst and M. J. E. Sternberg. Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks. Biochemistry, 31 (32): 7211–7218, 1992.
Article Google Scholar
A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. Hidden markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235 (5): 1501–1531, 1994.
Article Google Scholar
D. Kulp, D. Haussier, M. G. Reese, and F. H. Eeckman. A generalized hidden markov model for the recognition of human genes in dna. In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, pages 134–142, 1996.
Google Scholar
C. E. Lawrence and A. A. Reilly. An expectation-maximization (em) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Structure, Function, and Genetics, 7: 41–51, 1990.
Article Google Scholar
S. Lisser and H. Margalit. Compilation of e. coli mrna promoter sequences. Nucleic Acids Research, 21 (7): 1507–1516, 1993.
Article Google Scholar
D. J. C. Mackay. Bayesian interpolation. Neural Computation, 4 (3): 415–447, 1992a.
Article MATH Google Scholar
D. J. C. Mackay. A practical bayesian framework for backprop networks. Neural Computation, 4 (3): 448–472, 1992b.
Article Google Scholar
G. Mengeritsky and T. F. Smith. Recognition of characteristic patterns in sets of functionally equivalent dna sequences. Computer Applications in the Biosciences, 3 (3): 223–227, 1987.
Google Scholar
R. M. Neal. Bayesian Learning for Neural Networks. Number 118 in Lecture Notes in Statistics. Springer-Verlag, 1996.
Google Scholar
D. W. Opitz and J. W. Shavlik. Connectionist theory refinement: Genetically searching the space of network topologies. Journal of Artificial Intelligence Research, 6: 177–209, 1997.
MATH Google Scholar
O. N. Ozoline, A. A. Deev, and M. V. Arkhipova. Non-canonical sequence elements in the promoter structure. cluster analysis of promoters recognized by escherichia coli rna polymerase. Nucleic Acids Research, 25 (23): 4703–4709, 1997.
Article Google Scholar
A. G. Pedersen, P. Baldi, S. Brunak, and Y. Chauvin. Characterization of prokaryotic and eukaryotic promoters using hidden markov models. In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, pages 182–191, 1996.
Google Scholar
A. G. Pedersen and J. Engelbrecht. Investigations of escherichia coli promoter sequences with artificial neural networks: New signals discovered upstream of the transcriptional start point. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, pages 292–299, 1995.
Google Scholar
S. Salzberg. A decision tree system for finding genes in dna. Technical Report CS-97–03, Department of Computer Science, Johns Hopkins University, 1997a.
Google Scholar
S. Salzberg. A method for identifying splice sites and translational start sites in eukaryotic mrna. Computer Applications in the Biosciences, 13 (4): 365–376, 1997b.
Google Scholar
T. D. Schneider and R. M. Stephens. Sequence logos: A new way to display consensus sequences. Nucleic Acids Research,18(20):6097–6100, 1990.
Article Google Scholar
R. Staden. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Research, 18 (20): 6097–6100, 1990.
Article Google Scholar
J. T. L. Wang, T. G. Marr, D. Shasha, B. A. Shapiro, and G. Chirn. Discovering active motifs in sets of related protein sequences and using them for classification. Nucleic Acids Research, 22 (14): 2769–2775, 1994.
Article Google Scholar
J. T. L. Wang, T. G. Marr, D. Shasha, B. A. Shapiro, G. Chirn, and T. Y. Lee. Complementary classification approaches for protein sequences. Protein Engineering, 9 (5): 381–386, 1996.
Article Google Scholar
J. T. L. Wang, S. Rozen, B. A. Shapiro, D. Shasha, Z. Wang, and M. Yin. New techniques for dna sequence classification. Journal of Computational Biology, 6 (2): 209–218, 1999a.
Article Google Scholar
J. T. L. Wang, B. A. Shapiro, and D. Shasha. Pattern Discovery in Biomolecular Data: Tools, Techniques and Applications. Oxford University Press, New York, 1999b.
Google Scholar
C. H. Wu, M. Berry, Y. S. Fung, and J. McLarty. Neural networks for full-scale protein sequence classification: Sequence encoding with singular value decomposition. Machine Learning, 21:177–193, 1995.
Google Scholar
Y. Xu, R. J. Mural, J. R. Einstein, M. B. Shah, and E. C. Uberbacher. Grail: A multi-agent neural network system for gene identification. Proceedings of the IEEE, 84 (10): 1544–1551, 1996.
Article Google Scholar
M. O. Zhang and T. G. Marr. A weight array method for splicing signal analysis. Computer Applications in the Biosciences, 9(5):499–509, 1993.
Google Scholar
X. Zhang, J. P. Mesirov, and D. L. Waltz. Hybrid system for protein secondary structure prediction. Journal of Molecular Biology, 225 (4): 1049–1063, 1992.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer and Information Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
Qicheng Ma & Jason T. L. Wang
Los Alamos National Laboratory, Mail Stop E541, Los Alamos, NM, 87544, USA
James R. Gattiker

Authors

Qicheng Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jason T. L. Wang
View author publications
You can also search for this author in PubMed Google Scholar
James R. Gattiker
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

AT&T Labs Research, USA
James Abello & Mauricio G. C. Resende &
University of Florida, USA
Panos M. Pardalos

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ma, Q., Wang, J.T.L., Gattiker, J.R. (2002). Mining Biomolecular Data Using Background Knowledge and Artificial Neural Networks. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_30

Download citation

DOI: https://doi.org/10.1007/978-1-4615-0005-6_30
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-4882-5
Online ISBN: 978-1-4615-0005-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics