Skip to main content

Naïve Bayes-Based Classification for Short Microbial Genes Using Chaos Game Representation

  • Conference paper
  • First Online:
  • 1222 Accesses

Abstract

Studies focusing on recognition of short genes encoding small proteins will provide new essential biological insights. This chapter presents a novel method for prediction of short genes based on chaos game representation (CGR). CGR is a graphical representation of biological sequences such as DNAs and proteins. CGR uniquely represents DNA sequences and reveals hidden patterns in it. In this study, genomic feature extraction is implemented by computing the frequency chaos game representation (FCGR) matrix. The order 2, 3 and 4 FCGR matrices are considered here, which consist of 16, 64 and 256 elements, respectively. These element matrices act as the feature descriptor for classification. We utilized principal component analysis (PCA) as a preprocessing step to reduce the feature vector dimensionality and to improve the classification performance. A novel method for classification based on the combination of FCGR and state-of-the-art pattern recognition algorithm, Naïve Bayes classifier, is proposed. The results of the experiment reveal the potential of this representation for discrimination of short genes from noncoding DNA.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Schneider D, Volkmer T, Rogner M. PetG and PetN, but not PetL, are essential subunits of the cytochrome b6f complex from Synechocystis PCC 6803. Res Microbiol. 2007;158:45–50.

    Article  PubMed  CAS  Google Scholar 

  2. Yanofsky C. Transcription attenuation: once viewed as a novel regulatory strategy. J Bacteriol. 2000;182:1–8.

    Article  PubMed  CAS  Google Scholar 

  3. Cutting S, Anderson M, Lysenko E, Page A, Tomoyasu T, Tatematsu K, Tatsuta T, Kroos L, Ogura T. SpoVM, a small protein essential to development in Bacillus subtilis, interacts with the ATP-dependent protease FtsH. J Bacteriol. 1997;179:5534–42.

    PubMed  CAS  Google Scholar 

  4. Brent MR, Guigo R. Recent advances in gene structure prediction. Curr Opin Struct Biol. 2004;14:264–72.

    Article  PubMed  CAS  Google Scholar 

  5. Fickett JW, Tung CS. Assessment of protein coding measures. Nucleic Acids Res. 1992;20:6441–50.

    Article  PubMed  CAS  Google Scholar 

  6. Mathe C, Sagot MF, Schiex T, Rouze P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 2002;30:4103–17.

    Article  PubMed  CAS  Google Scholar 

  7. Wang Z, Chen Y, Li Y. A brief review of computational gene prediction methods. Genomic Proteomic Bioinform. 2004;2:216–21.

    CAS  Google Scholar 

  8. Do JH, Choi DK. Computational approaches to gene prediction. J Microbiol. 2006;44:137–44.

    PubMed  Google Scholar 

  9. Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Werner G, Padki A, Zhao X, Dubchak I, Hugenholtz P, Anderson I, Lykidis A, Mavromatis K, Ivanova N, Kyrpides NC. The integrated microbial genomes (IMG) system. Nucleic Acids Res. 2006;34:D344–8.

    Article  PubMed  CAS  Google Scholar 

  10. Jeffrey HJ. Chaos game representation of gene structure. Nucleic Acids Res. 1990;18:2163–70.

    Article  PubMed  CAS  Google Scholar 

  11. Tino P. Spatial representation of symbolic sequences through iterative function systems. IEEE Trans Syst Man Cybern A Syst Hum. 1998;29(4):386–92.

    Article  Google Scholar 

  12. Basu S, Pan A, Dutta C, Das J. Chaos game representation of proteins. J Mol Graph. 1997;15:279–89.

    Article  CAS  Google Scholar 

  13. Donoho DL. High-dimensional data analysis: the curses and blessings of dimensionality In: American Mathematical Society Conference Math Challenges of the 21st Century; 2000.

    Google Scholar 

  14. Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is “Nearest Neighbor” meaningful? Database theory – ICDT’99. In: Beeri C, Buneman P, editors. Database theory – ICDT’99. Berlin//Heidelberg: Springer; 1999. p. 217–35.

    Chapter  Google Scholar 

  15. Jollie IT. Principal component analysis, Springer series in statistics. New York: Springer; 1986. p. 64–91.

    Google Scholar 

  16. Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psych. 1933;24

    Google Scholar 

  17. Feller W. An introduction to probability theory and its applications, vol. 2. 2nd ed. New York: Wiley; 1971.

    Google Scholar 

  18. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update, SIGKDD Explor. Newsl. 2009;11:10–8.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Baharak Goli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer India

About this paper

Cite this paper

Goli, B., Aswathi, B.L., Nair, A.S. (2012). Naïve Bayes-Based Classification for Short Microbial Genes Using Chaos Game Representation. In: Sabu, A., Augustine, A. (eds) Prospects in Bioscience: Addressing the Issues. Springer, India. https://doi.org/10.1007/978-81-322-0810-5_5

Download citation

Publish with us

Policies and ethics