Abstract
Bioinformatics has gained wide importance in research area for the last few decades. The main aim is to store the biological data and analyze it for better understanding. To predict the functions of newly added protein sequences, the classification of existing protein sequence is of great use. The rate at which protein sequence data is getting accumulated is increasing exponentially. So, it emerges as a very challenging task for the researcher, to deal with large number of features obtained by the use of various encoding techniques. Here, a two-stage algorithm is proposed for feature selection that combines ReliefF and CFS technique that takes extracted features as input and provides us with the discriminative set of features. The n-gram sequence encoding technique has been used to extract the feature vector from the protein sequences. In the first stage, ReliefF approach is used to rank the features and obtain candidate feature set. In the second stage, CFS is applied on this candidate feature set to obtain features that have high correlation with the class but less correlation with other features. The classification methods like Naive-Bayes, decision tree, and k-nearest neighbor can be used to analyze the performance of proposed approach. It is observed that this approach has increased accuracy of classification methods in comparison to existing methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sequence Database. https://en.wikipedia.org/wiki/Sequence_database
Saidi, R., Maddouri, M., Nguifo, E.M.: Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinform. 11(1), 1 (2010)
Ladha, L., Deepa, T.: Feature selection methods and algorithms. Int. J. Comput. Sci. Eng. (IJCSE) (2011)
Iqbal, M.J., et al.: Efficient feature selection and classification of protein sequence data in bioinformatics. Sci. World J. 2004 (2014)
Caragea, C., Silvescu, A., Mitra, P.: Protein sequence classification using feature hashing. Proteome Sci. 10(1), 1 (2012)
Forman, G., Kirshenbaum, E.: Extremely fast text feature extraction for classification and indexing. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM (2008)
Boln-Canedo, V., et al.: A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014)
Patil, N., Toshniwal, D., Garg, K.: Effective framework for protein structure prediction. Int. J. Funct. Inf. Pers. Med. 4(1), 69–79 (2012)
Dash, R., Misra, B.B.: Pipelining the ranking techniques for microarray data classification: a case study. Appl. Soft Comput. 48, 298–316 (2016)
Song, Q., Ni, J., Wang, G.: A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans. Knowl. Data Eng. 25(1), 1–14 (2013)
Bennasar, M., Hicks, Y., Setchi, R.: Feature selection using joint mutual information maximisation. Expert Syst. Appl. 42(22), 8520–8532 (2015)
Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28(1), 45–48 (2000)
National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov
Sun, Y., Wong, A.K.C., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recogn. Artif. Intell. 23(04), 687–719 (2009)
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kaur, K., Patil, N. (2019). A Novel Technique of Feature Selection with ReliefF and CFS for Protein Sequence Classification. In: Sa, P., Bakshi, S., Hatzilygeroudis, I., Sahoo, M. (eds) Recent Findings in Intelligent Computing Techniques . Advances in Intelligent Systems and Computing, vol 707. Springer, Singapore. https://doi.org/10.1007/978-981-10-8639-7_41
Download citation
DOI: https://doi.org/10.1007/978-981-10-8639-7_41
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8638-0
Online ISBN: 978-981-10-8639-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)