Skip to main content

A Novel Technique of Feature Selection with ReliefF and CFS for Protein Sequence Classification

  • Conference paper
  • First Online:
Recent Findings in Intelligent Computing Techniques

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 707))

  • 838 Accesses

Abstract

Bioinformatics has gained wide importance in research area for the last few decades. The main aim is to store the biological data and analyze it for better understanding. To predict the functions of newly added protein sequences, the classification of existing protein sequence is of great use. The rate at which protein sequence data is getting accumulated is increasing exponentially. So, it emerges as a very challenging task for the researcher, to deal with large number of features obtained by the use of various encoding techniques. Here, a two-stage algorithm is proposed for feature selection that combines ReliefF and CFS technique that takes extracted features as input and provides us with the discriminative set of features. The n-gram sequence encoding technique has been used to extract the feature vector from the protein sequences. In the first stage, ReliefF approach is used to rank the features and obtain candidate feature set. In the second stage, CFS is applied on this candidate feature set to obtain features that have high correlation with the class but less correlation with other features. The classification methods like Naive-Bayes, decision tree, and k-nearest neighbor can be used to analyze the performance of proposed approach. It is observed that this approach has increased accuracy of classification methods in comparison to existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Sequence Database. https://en.wikipedia.org/wiki/Sequence_database

  2. Saidi, R., Maddouri, M., Nguifo, E.M.: Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinform. 11(1), 1 (2010)

    Article  Google Scholar 

  3. Ladha, L., Deepa, T.: Feature selection methods and algorithms. Int. J. Comput. Sci. Eng. (IJCSE) (2011)

    Google Scholar 

  4. Iqbal, M.J., et al.: Efficient feature selection and classification of protein sequence data in bioinformatics. Sci. World J. 2004 (2014)

    Google Scholar 

  5. Caragea, C., Silvescu, A., Mitra, P.: Protein sequence classification using feature hashing. Proteome Sci. 10(1), 1 (2012)

    Article  Google Scholar 

  6. Forman, G., Kirshenbaum, E.: Extremely fast text feature extraction for classification and indexing. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM (2008)

    Google Scholar 

  7. Boln-Canedo, V., et al.: A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014)

    Article  Google Scholar 

  8. Patil, N., Toshniwal, D., Garg, K.: Effective framework for protein structure prediction. Int. J. Funct. Inf. Pers. Med. 4(1), 69–79 (2012)

    Google Scholar 

  9. Dash, R., Misra, B.B.: Pipelining the ranking techniques for microarray data classification: a case study. Appl. Soft Comput. 48, 298–316 (2016)

    Article  Google Scholar 

  10. Song, Q., Ni, J., Wang, G.: A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans. Knowl. Data Eng. 25(1), 1–14 (2013)

    Article  Google Scholar 

  11. Bennasar, M., Hicks, Y., Setchi, R.: Feature selection using joint mutual information maximisation. Expert Syst. Appl. 42(22), 8520–8532 (2015)

    Article  Google Scholar 

  12. Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28(1), 45–48 (2000)

    Article  Google Scholar 

  13. National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov

  14. Sun, Y., Wong, A.K.C., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recogn. Artif. Intell. 23(04), 687–719 (2009)

    Article  Google Scholar 

  15. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kiranpreet Kaur .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kaur, K., Patil, N. (2019). A Novel Technique of Feature Selection with ReliefF and CFS for Protein Sequence Classification. In: Sa, P., Bakshi, S., Hatzilygeroudis, I., Sahoo, M. (eds) Recent Findings in Intelligent Computing Techniques . Advances in Intelligent Systems and Computing, vol 707. Springer, Singapore. https://doi.org/10.1007/978-981-10-8639-7_41

Download citation

Publish with us

Policies and ethics