A Novel Approach to Find the Saturation Point of n-Gram Encoding Method for Protein Sequence Classification Involving Data Mining

Saha, Suprativ; Bhattacharya, Tanmay

doi:10.1007/978-981-13-2354-6_12

A Novel Approach to Find the Saturation Point of n-Gram Encoding Method for Protein Sequence Classification Involving Data Mining

Suprativ Saha⁷ &
Tanmay Bhattacharya⁸

Conference paper
First Online: 20 November 2018

858 Accesses
6 Citations

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 56))

Abstract

In the field of biological data mining, protein sequence classification is one of the most popular research area. To classify the protein sequence, features must be extracted from the input data. The various researchers used n-gram encoding method to extract feature value. Generally, to reduce the computational time, the value of n of n-gram encoding method is considered as 2, but accuracy level of classification degrades. So, it is an important research, to find the optimum value of n for n-gram encoding method, where computational time and accuracy level of classification both are acceptable. In this work, an experimental attempt has been made to fixed up the limit of scaling of n-gram encoding method from 2-gram to 5-gram. Standard deviation method has been used for this purpose.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Wang JTL, Ma QH, Shasha D, Wu CH (2000) Application of neural networks to biological data mining: a case study in protein sequence classification. In: KDD, Boston, pp 305–309
Google Scholar
Zainuddin Z, Kumar M (2008) Radial basic function neural networks in protein sequence classification. Malays J Math Sci. 195–204
Google Scholar
Nageswara Rao PV, Uma Devi T, Kaladhar D, Sridhar Gr, Rao AA (2009) A probabilistic neural network approach for protein superfamily classification. J Theor Appl Inf Technol
Google Scholar
Mohamed S, Rubin D, Marwala T (2006) Multi-class protein sequence classification using Fuzzy ARTMAP. In: IEEE conference, pp 1676–1680
Google Scholar
Mansoori EG, Zolghadri MJ, Katebi SD, Mohabatkar H, Boostani R, Sadreddini MH (2008) Generating fuzzy rules for protein classification. Iran J Fuzzy Syst 5(2):21–33
MathSciNet MATH Google Scholar
Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ (2003) SVM-prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acid Res 31:3692–3697
Article Google Scholar
Saha S, Chaki R (2012) Application of data mining in protein sequence classification. IJDMS 4(5)
Article Google Scholar
Saha S, Chaki R (2012) A brief review of data mining application involving protein sequence classification. In: ACITY 2012. AISC, vol 177. Springer, India, pp 469–477
Chapter Google Scholar
Spalding JD, Hoyle DC (2005) Accuracy of string kernels for protein sequence classification. In: ICAPR 2005. LNCS, vol 3686. Springer
Google Scholar
Zaki NM, Deri S, Illias RM (2005) Protein sequences classification based on string weighting scheme. Int J Comput Internet Manag 13(1):50–60
Google Scholar
Ali AF, Shawky DM (2010) A novel approach for protein classification using fourier transform. Int J Eng Appl Sci 6:4
Google Scholar
Boujenfa K, Essoussi N, Limam M (2011) Tree-kNN: a tree-based algorithm for protein sequence classification. IJCSE 3:961–968. ISSN 0975-3397
Google Scholar
Desai P (2005) Sequence classification using hidden Markov models, electronic thesis or dissertation. https://etd.ohiolink.edu/
Rahman MM, Alam AU, Abdullah-Al-Mamun, Mursalin TE (2010) A more appropriate protein classification using data mining. JATIT 33–43
Google Scholar
Caragea C, Silvescu A, Mitra P (2012) Protein sequence classification using feature hashing. Proteome Sci 10(Suppl 1):S14. https://doi.org/10.1186/1477-5956-10-S1-S14
Article Google Scholar
Zhao X-M, Huang D-S, Cheung Y-M, Wang H-Q, Xin H (2004) A novel hybrid GA/SVM system for protein sequences classification. In: IDEAL 2004. LNCS, vol 3177. Springer, pp 11–16
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Brainware University, Barasat, Kolkata, India
Suprativ Saha
Department of Information Technology, Techno India, Salt Lake, Kolkata, India
Tanmay Bhattacharya

Authors

Suprativ Saha
View author publications
You can also search for this author in PubMed Google Scholar
Tanmay Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suprativ Saha .

Editor information

Editors and Affiliations

Department of Computer Application, RCC Institute of Information Technology, Kolkata, West Bengal, India
Siddhartha Bhattacharyya
Faculty of Computers and Information, Cairo University, Giza, Egypt
Aboul Ella Hassanien
Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, New Delhi, Delhi, India
Deepak Gupta
Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, New Delhi, Delhi, India
Ashish Khanna
Department of Information Technology, RCC Institute of Information Technology, Kolkata, West Bengal, India
Indrajit Pan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Saha, S., Bhattacharya, T. (2019). A Novel Approach to Find the Saturation Point of n-Gram Encoding Method for Protein Sequence Classification Involving Data Mining. In: Bhattacharyya, S., Hassanien, A., Gupta, D., Khanna, A., Pan, I. (eds) International Conference on Innovative Computing and Communications. Lecture Notes in Networks and Systems, vol 56. Springer, Singapore. https://doi.org/10.1007/978-981-13-2354-6_12

Download citation

DOI: https://doi.org/10.1007/978-981-13-2354-6_12
Published: 20 November 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2353-9
Online ISBN: 978-981-13-2354-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics