Abstract
Studies focusing on recognition of short genes encoding small proteins will provide new essential biological insights. This chapter presents a novel method for prediction of short genes based on chaos game representation (CGR). CGR is a graphical representation of biological sequences such as DNAs and proteins. CGR uniquely represents DNA sequences and reveals hidden patterns in it. In this study, genomic feature extraction is implemented by computing the frequency chaos game representation (FCGR) matrix. The order 2, 3 and 4 FCGR matrices are considered here, which consist of 16, 64 and 256 elements, respectively. These element matrices act as the feature descriptor for classification. We utilized principal component analysis (PCA) as a preprocessing step to reduce the feature vector dimensionality and to improve the classification performance. A novel method for classification based on the combination of FCGR and state-of-the-art pattern recognition algorithm, Naïve Bayes classifier, is proposed. The results of the experiment reveal the potential of this representation for discrimination of short genes from noncoding DNA.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Schneider D, Volkmer T, Rogner M. PetG and PetN, but not PetL, are essential subunits of the cytochrome b6f complex from Synechocystis PCC 6803. Res Microbiol. 2007;158:45–50.
Yanofsky C. Transcription attenuation: once viewed as a novel regulatory strategy. J Bacteriol. 2000;182:1–8.
Cutting S, Anderson M, Lysenko E, Page A, Tomoyasu T, Tatematsu K, Tatsuta T, Kroos L, Ogura T. SpoVM, a small protein essential to development in Bacillus subtilis, interacts with the ATP-dependent protease FtsH. J Bacteriol. 1997;179:5534–42.
Brent MR, Guigo R. Recent advances in gene structure prediction. Curr Opin Struct Biol. 2004;14:264–72.
Fickett JW, Tung CS. Assessment of protein coding measures. Nucleic Acids Res. 1992;20:6441–50.
Mathe C, Sagot MF, Schiex T, Rouze P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 2002;30:4103–17.
Wang Z, Chen Y, Li Y. A brief review of computational gene prediction methods. Genomic Proteomic Bioinform. 2004;2:216–21.
Do JH, Choi DK. Computational approaches to gene prediction. J Microbiol. 2006;44:137–44.
Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Werner G, Padki A, Zhao X, Dubchak I, Hugenholtz P, Anderson I, Lykidis A, Mavromatis K, Ivanova N, Kyrpides NC. The integrated microbial genomes (IMG) system. Nucleic Acids Res. 2006;34:D344–8.
Jeffrey HJ. Chaos game representation of gene structure. Nucleic Acids Res. 1990;18:2163–70.
Tino P. Spatial representation of symbolic sequences through iterative function systems. IEEE Trans Syst Man Cybern A Syst Hum. 1998;29(4):386–92.
Basu S, Pan A, Dutta C, Das J. Chaos game representation of proteins. J Mol Graph. 1997;15:279–89.
Donoho DL. High-dimensional data analysis: the curses and blessings of dimensionality In: American Mathematical Society Conference Math Challenges of the 21st Century; 2000.
Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is “Nearest Neighbor” meaningful? Database theory – ICDT’99. In: Beeri C, Buneman P, editors. Database theory – ICDT’99. Berlin//Heidelberg: Springer; 1999. p. 217–35.
Jollie IT. Principal component analysis, Springer series in statistics. New York: Springer; 1986. p. 64–91.
Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psych. 1933;24
Feller W. An introduction to probability theory and its applications, vol. 2. 2nd ed. New York: Wiley; 1971.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update, SIGKDD Explor. Newsl. 2009;11:10–8.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer India
About this paper
Cite this paper
Goli, B., Aswathi, B.L., Nair, A.S. (2012). Naïve Bayes-Based Classification for Short Microbial Genes Using Chaos Game Representation. In: Sabu, A., Augustine, A. (eds) Prospects in Bioscience: Addressing the Issues. Springer, India. https://doi.org/10.1007/978-81-322-0810-5_5
Download citation
DOI: https://doi.org/10.1007/978-81-322-0810-5_5
Published:
Publisher Name: Springer, India
Print ISBN: 978-81-322-0809-9
Online ISBN: 978-81-322-0810-5
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)