Skip to main content

A Novel Approach to Find the Saturation Point of n-Gram Encoding Method for Protein Sequence Classification Involving Data Mining

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 56))

Abstract

In the field of biological data mining, protein sequence classification is one of the most popular research area. To classify the protein sequence, features must be extracted from the input data. The various researchers used n-gram encoding method to extract feature value. Generally, to reduce the computational time, the value of n of n-gram encoding method is considered as 2, but accuracy level of classification degrades. So, it is an important research, to find the optimum value of n for n-gram encoding method, where computational time and accuracy level of classification both are acceptable. In this work, an experimental attempt has been made to fixed up the limit of scaling of n-gram encoding method from 2-gram to 5-gram. Standard deviation method has been used for this purpose.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Wang JTL, Ma QH, Shasha D, Wu CH (2000) Application of neural networks to biological data mining: a case study in protein sequence classification. In: KDD, Boston, pp 305–309

    Google Scholar 

  2. Zainuddin Z, Kumar M (2008) Radial basic function neural networks in protein sequence classification. Malays J Math Sci. 195–204

    Google Scholar 

  3. Nageswara Rao PV, Uma Devi T, Kaladhar D, Sridhar Gr, Rao AA (2009) A probabilistic neural network approach for protein superfamily classification. J Theor Appl Inf Technol

    Google Scholar 

  4. Mohamed S, Rubin D, Marwala T (2006) Multi-class protein sequence classification using Fuzzy ARTMAP. In: IEEE conference, pp 1676–1680

    Google Scholar 

  5. Mansoori EG, Zolghadri MJ, Katebi SD, Mohabatkar H, Boostani R, Sadreddini MH (2008) Generating fuzzy rules for protein classification. Iran J Fuzzy Syst 5(2):21–33

    MathSciNet  MATH  Google Scholar 

  6. Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ (2003) SVM-prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acid Res 31:3692–3697

    Article  Google Scholar 

  7. Saha S, Chaki R (2012) Application of data mining in protein sequence classification. IJDMS 4(5)

    Article  Google Scholar 

  8. Saha S, Chaki R (2012) A brief review of data mining application involving protein sequence classification. In: ACITY 2012. AISC, vol 177. Springer, India, pp 469–477

    Chapter  Google Scholar 

  9. Spalding JD, Hoyle DC (2005) Accuracy of string kernels for protein sequence classification. In: ICAPR 2005. LNCS, vol 3686. Springer

    Google Scholar 

  10. Zaki NM, Deri S, Illias RM (2005) Protein sequences classification based on string weighting scheme. Int J Comput Internet Manag 13(1):50–60

    Google Scholar 

  11. Ali AF, Shawky DM (2010) A novel approach for protein classification using fourier transform. Int J Eng Appl Sci 6:4

    Google Scholar 

  12. Boujenfa K, Essoussi N, Limam M (2011) Tree-kNN: a tree-based algorithm for protein sequence classification. IJCSE 3:961–968. ISSN 0975-3397

    Google Scholar 

  13. Desai P (2005) Sequence classification using hidden Markov models, electronic thesis or dissertation. https://etd.ohiolink.edu/

  14. Rahman MM, Alam AU, Abdullah-Al-Mamun, Mursalin TE (2010) A more appropriate protein classification using data mining. JATIT 33–43

    Google Scholar 

  15. Caragea C, Silvescu A, Mitra P (2012) Protein sequence classification using feature hashing. Proteome Sci 10(Suppl 1):S14. https://doi.org/10.1186/1477-5956-10-S1-S14

    Article  Google Scholar 

  16. Zhao X-M, Huang D-S, Cheung Y-M, Wang H-Q, Xin H (2004) A novel hybrid GA/SVM system for protein sequences classification. In: IDEAL 2004. LNCS, vol 3177. Springer, pp 11–16

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suprativ Saha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Saha, S., Bhattacharya, T. (2019). A Novel Approach to Find the Saturation Point of n-Gram Encoding Method for Protein Sequence Classification Involving Data Mining. In: Bhattacharyya, S., Hassanien, A., Gupta, D., Khanna, A., Pan, I. (eds) International Conference on Innovative Computing and Communications. Lecture Notes in Networks and Systems, vol 56. Springer, Singapore. https://doi.org/10.1007/978-981-13-2354-6_12

Download citation

Publish with us

Policies and ethics