Abstract
In this paper, the learning performance of different machine learning algorithms is investigated by applying fuzzy-rough feature selection (FRFS) technique on optimally balanced training and testing sets, consisting of the piezophilic and nonpiezophilic proteins. By experimenting using FRFS technique followed by Synthetic Minority Over-sampling Technique (SMOTE) at optimal balancing ratios, we obtain the best results by achieving sensitivity of 79.60%, specificity of 74.50%, average accuracy of 77.10%, AUC of 0.841, and MCC of 0.542 with random forest algorithm. The ranking of input features according to their differentiating ability of piezophilic and nonpiezophilic proteins is presented by using fuzzy-rough attribute evaluator. From the results, it is observed that the performance of classification algorithms can be improved by selecting the reduced optimally balanced training and testing sets. This can be obtained by selecting the relevant and non-redundant features from training sets using FRFS approach followed by suitably modifying the class distribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baldi, P., Brunak, S.: Bioinformatics: The Machine Learning approach. MIT press (2001)
Breiman, L.: Random Forests. Mach. Learn. 45(1), 5–32 (2001)
Chawla, N.V.: Data Mining for Imbalanced Datasets: An Overview. Data Mining and Knowledge Discovery Handbook, pp. 875–886. Springer (2009)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(1–4), 131–156 (1997)
Dubois, D., Prade, H.: Putting Rough Sets and Fuzzy Sets Together Intelligent Decision Support, pp. 203–232. Springer (1992)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)
Jensen, R., Shen, Q.: Fuzzy rough attribute reduction with application to web categorization. Fuzzy Sets Syst. 141(3), 469–485 (2004a)
Jensen, R., Shen, Q.: Semantics-preserving dimensionality reduction: rough and fuzzy-rough-based approaches. IEEE Trans. Knowl. Data Eng. 16(12), 1457–1471 (2004b)
Jensen, R., Shen, Q.: Fuzzy-rough sets assisted attribute selection. IEEE Trans. Fuzzy Syst. 15(1), 73–89 (2007)
Jensen, R., Shen, Q.: Computational Intelligence and Feature Selection: Rough and Fuzzy Approaches, Vol. 8. Wiley (2008)
Langley, P.: Selection of relevant features in machine learning. Paper presented at the Proceedings of the AAAI Fall Symposium on Relevance
Lee, P.H.: Resampling methods improve the predictive power of modeling in class-imbalanced datasets. Int. J. Environ. Res. Public Health 11(9), 9776–9789
Li, H., Pi, D., Wang, C.: The prediction of protein-protein interaction sites based on RBF classifier improved by SMOTE. Math. Prob, Eng (2014)
Ling, C., Huang, J., Zhang, H.: AUC: a better measure than accuracy in comparing learning algorithms. Adv. Artif. Intell. 991–991 (2003)
Liu, H., Motoda, H.: Feature Extraction, Construction and Selection: A Data Mining Perspective, vol. 453. Springer Science and Business Media (1998)
Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 14(1), 106 (2013)
Nath, A., Chaube, R., Karthikeyan, S.: Discrimination of psychrophilic and mesophilic proteins using random forest algorithm. Paper presented at the 2012 International Conference on Biomedical Engineering and Biotechnology (iCBEB) (2012)
Nath, A., Karthikeyan, S.: Enhanced prediction and characterization of CDK inhibitors using optimal class distribution. Interdisc. Sci. Comput. Life Sci. 9(2), 292–303 (2017)
Nath, A., Subbiah, K.: Inferring biological basis about psychrophilicity by interpreting the rules generated from the correctly classified input instances by a classifier. Comput. Biol. Chem. 53, 198–203 (2014)
Nath, A., Subbiah, K.: Maximizing lipocalin prediction through balanced and diversified training set and decision fusion. Comput. Biol. Chem. 59, 101–110 (2015)
Nath, A., Subbiah, K.: Insights into the molecular basis of piezophilic adaptation: extraction of piezophilic signatures. J. Theoret. Biol. 390, 117–126 (2016)
Okun, O.: Feature Selection and Ensemble Methods for Bioinformatics: Algorithmic Classification and Implementations. Information Science Reference-Imprint of IGI Publishing (2011)
Pawlak, Z.: Rough sets. Int. J. Parallel. Program. 11(5), 341–356 (1982)
Platt, J.: Sequential minimal optimization: a fast algorithm for training support vector machines (1998)
Prompramote, S., Chen, Y., Chen, Y.-P.P.: Machine learning in bioinformatics. In: Chen, Y.-P.P. (ed.) Bioinformatics Technologies, pp. 117–153. Springer, Berlin Heidelberg, Berlin, Heidelberg (2005)
Rodriguez, J.J., Kuncheva, L.I., Alonso, C.J.: Rotation forest: a new classifier ensemble method. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1619–1630 (2006)
Ruck, D.W., Rogers, S.K., Kabrisky, M., Oxley, M.E., Suter, B.W.: The multilayer perceptron as an approximation to a bayes optimal discriminant function. IEEE Trans. Neural Netw. 1(4), 296–298 (1990)
Tiwari, A.K., Nath, A., Subbiah, K., Shukla, K.K.: Effect of varying degree of resampling on prediction accuracy for observed peptide count in protein mass spectrometry data. Paper presented at the 2015 11th International Conference on Natural Computation (ICNC) (2015)
Tiwari, A.K., Nath, A., Subbiah, K., Shukla, K.K.: Enhanced prediction for observed peptide count in protein mass spectrometry data by optimally balancing the training dataset. Int. J. Pattern Recogn. Artif. Intell. 1750040 (2017)
Vani, K.S., Bhavani, S.D.: SMOTE based protein fold prediction classification. In: Advances in Computing and Information Technology, pp. 541–550. Springer (2013)
Wang, L., Fu, X.: Data Mining with Computational Intelligence. Springer Science and Business Media (2006)
Weiss, G.M., Provost, F.: The effect of class distribution on classifier learning: an empirical study. Rutgers Univ (2001)
Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)
Zadeh, L.A.: Fuzzy sets. Inf. Control 8(3), 338–353 (1965)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Tiwari, A.K., Shreevastava, S., Subbiah, K., Som, T. (2018). Enhanced Prediction for Piezophilic Protein by Incorporating Reduced Set of Amino Acids Using Fuzzy-Rough Feature Selection Technique Followed by SMOTE. In: Ghosh, D., Giri, D., Mohapatra, R., Sakurai, K., Savas, E., Som, T. (eds) Mathematics and Computing. ICMC 2018. Springer Proceedings in Mathematics & Statistics, vol 253. Springer, Singapore. https://doi.org/10.1007/978-981-13-2095-8_15
Download citation
DOI: https://doi.org/10.1007/978-981-13-2095-8_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2094-1
Online ISBN: 978-981-13-2095-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)