Enhanced Prediction for Piezophilic Protein by Incorporating Reduced Set of Amino Acids Using Fuzzy-Rough Feature Selection Technique Followed by SMOTE

Tiwari, Anoop Kumar; Shreevastava, Shivam; Subbiah, Karthikeyan; Som, Tanmoy

doi:10.1007/978-981-13-2095-8_15

Anoop Kumar Tiwari⁷,
Shivam Shreevastava⁸,
Karthikeyan Subbiah⁷ &
…
Tanmoy Som⁸

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 253))

Included in the following conference series:

International Conference on Mathematics and Computing

Abstract

In this paper, the learning performance of different machine learning algorithms is investigated by applying fuzzy-rough feature selection (FRFS) technique on optimally balanced training and testing sets, consisting of the piezophilic and nonpiezophilic proteins. By experimenting using FRFS technique followed by Synthetic Minority Over-sampling Technique (SMOTE) at optimal balancing ratios, we obtain the best results by achieving sensitivity of 79.60%, specificity of 74.50%, average accuracy of 77.10%, AUC of 0.841, and MCC of 0.542 with random forest algorithm. The ranking of input features according to their differentiating ability of piezophilic and nonpiezophilic proteins is presented by using fuzzy-rough attribute evaluator. From the results, it is observed that the performance of classification algorithms can be improved by selecting the reduced optimally balanced training and testing sets. This can be obtained by selecting the relevant and non-redundant features from training sets using FRFS approach followed by suitably modifying the class distribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Baldi, P., Brunak, S.: Bioinformatics: The Machine Learning approach. MIT press (2001)
Google Scholar
Breiman, L.: Random Forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Chawla, N.V.: Data Mining for Imbalanced Datasets: An Overview. Data Mining and Knowledge Discovery Handbook, pp. 875–886. Springer (2009)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(1–4), 131–156 (1997)
Article Google Scholar
Dubois, D., Prade, H.: Putting Rough Sets and Fuzzy Sets Together Intelligent Decision Support, pp. 203–232. Springer (1992)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)
Article Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)
MATH Google Scholar
Jensen, R., Shen, Q.: Fuzzy rough attribute reduction with application to web categorization. Fuzzy Sets Syst. 141(3), 469–485 (2004a)
Article MathSciNet Google Scholar
Jensen, R., Shen, Q.: Semantics-preserving dimensionality reduction: rough and fuzzy-rough-based approaches. IEEE Trans. Knowl. Data Eng. 16(12), 1457–1471 (2004b)
Article Google Scholar
Jensen, R., Shen, Q.: Fuzzy-rough sets assisted attribute selection. IEEE Trans. Fuzzy Syst. 15(1), 73–89 (2007)
Article Google Scholar
Jensen, R., Shen, Q.: Computational Intelligence and Feature Selection: Rough and Fuzzy Approaches, Vol. 8. Wiley (2008)
Google Scholar
Langley, P.: Selection of relevant features in machine learning. Paper presented at the Proceedings of the AAAI Fall Symposium on Relevance
Google Scholar
Lee, P.H.: Resampling methods improve the predictive power of modeling in class-imbalanced datasets. Int. J. Environ. Res. Public Health 11(9), 9776–9789
Article Google Scholar
Li, H., Pi, D., Wang, C.: The prediction of protein-protein interaction sites based on RBF classifier improved by SMOTE. Math. Prob, Eng (2014)
Google Scholar
Ling, C., Huang, J., Zhang, H.: AUC: a better measure than accuracy in comparing learning algorithms. Adv. Artif. Intell. 991–991 (2003)
Google Scholar
Liu, H., Motoda, H.: Feature Extraction, Construction and Selection: A Data Mining Perspective, vol. 453. Springer Science and Business Media (1998)
Google Scholar
Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 14(1), 106 (2013)
Article Google Scholar
Nath, A., Chaube, R., Karthikeyan, S.: Discrimination of psychrophilic and mesophilic proteins using random forest algorithm. Paper presented at the 2012 International Conference on Biomedical Engineering and Biotechnology (iCBEB) (2012)
Google Scholar
Nath, A., Karthikeyan, S.: Enhanced prediction and characterization of CDK inhibitors using optimal class distribution. Interdisc. Sci. Comput. Life Sci. 9(2), 292–303 (2017)
Article Google Scholar
Nath, A., Subbiah, K.: Inferring biological basis about psychrophilicity by interpreting the rules generated from the correctly classified input instances by a classifier. Comput. Biol. Chem. 53, 198–203 (2014)
Article Google Scholar
Nath, A., Subbiah, K.: Maximizing lipocalin prediction through balanced and diversified training set and decision fusion. Comput. Biol. Chem. 59, 101–110 (2015)
Article Google Scholar
Nath, A., Subbiah, K.: Insights into the molecular basis of piezophilic adaptation: extraction of piezophilic signatures. J. Theoret. Biol. 390, 117–126 (2016)
Article MathSciNet Google Scholar
Okun, O.: Feature Selection and Ensemble Methods for Bioinformatics: Algorithmic Classification and Implementations. Information Science Reference-Imprint of IGI Publishing (2011)
Google Scholar
Pawlak, Z.: Rough sets. Int. J. Parallel. Program. 11(5), 341–356 (1982)
MATH Google Scholar
Platt, J.: Sequential minimal optimization: a fast algorithm for training support vector machines (1998)
Google Scholar
Prompramote, S., Chen, Y., Chen, Y.-P.P.: Machine learning in bioinformatics. In: Chen, Y.-P.P. (ed.) Bioinformatics Technologies, pp. 117–153. Springer, Berlin Heidelberg, Berlin, Heidelberg (2005)
Chapter Google Scholar
Rodriguez, J.J., Kuncheva, L.I., Alonso, C.J.: Rotation forest: a new classifier ensemble method. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1619–1630 (2006)
Article Google Scholar
Ruck, D.W., Rogers, S.K., Kabrisky, M., Oxley, M.E., Suter, B.W.: The multilayer perceptron as an approximation to a bayes optimal discriminant function. IEEE Trans. Neural Netw. 1(4), 296–298 (1990)
Article Google Scholar
Tiwari, A.K., Nath, A., Subbiah, K., Shukla, K.K.: Effect of varying degree of resampling on prediction accuracy for observed peptide count in protein mass spectrometry data. Paper presented at the 2015 11th International Conference on Natural Computation (ICNC) (2015)
Google Scholar
Tiwari, A.K., Nath, A., Subbiah, K., Shukla, K.K.: Enhanced prediction for observed peptide count in protein mass spectrometry data by optimally balancing the training dataset. Int. J. Pattern Recogn. Artif. Intell. 1750040 (2017)
Google Scholar
Vani, K.S., Bhavani, S.D.: SMOTE based protein fold prediction classification. In: Advances in Computing and Information Technology, pp. 541–550. Springer (2013)
Google Scholar
Wang, L., Fu, X.: Data Mining with Computational Intelligence. Springer Science and Business Media (2006)
Google Scholar
Weiss, G.M., Provost, F.: The effect of class distribution on classifier learning: an empirical study. Rutgers Univ (2001)
Google Scholar
Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)
Article Google Scholar
Zadeh, L.A.: Fuzzy sets. Inf. Control 8(3), 338–353 (1965)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Institute of Science (BHU), Varanasi, India
Anoop Kumar Tiwari & Karthikeyan Subbiah
Department of Mathematical Sciences, Indian Institute of Technology (BHU), Varanasi, India
Shivam Shreevastava & Tanmoy Som

Authors

Anoop Kumar Tiwari
View author publications
You can also search for this author in PubMed Google Scholar
Shivam Shreevastava
View author publications
You can also search for this author in PubMed Google Scholar
Karthikeyan Subbiah
View author publications
You can also search for this author in PubMed Google Scholar
Tanmoy Som
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shivam Shreevastava .

Editor information

Editors and Affiliations

Department of Mathematical Sciences, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh, India
Debdas Ghosh
Department of Computer Science and Engineering, Haldia Institute of Technology, Haldia, West Bengal, India
Debasis Giri
Department of Mathematics, University of Central Florida, Orlando, FL, USA
Ram N. Mohapatra
Faculty of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan
Kouichi Sakurai
Uşak University, Uşak, Turkey
Ekrem Savas
Department of Mathematical Sciences, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh, India
Tanmoy Som

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tiwari, A.K., Shreevastava, S., Subbiah, K., Som, T. (2018). Enhanced Prediction for Piezophilic Protein by Incorporating Reduced Set of Amino Acids Using Fuzzy-Rough Feature Selection Technique Followed by SMOTE. In: Ghosh, D., Giri, D., Mohapatra, R., Sakurai, K., Savas, E., Som, T. (eds) Mathematics and Computing. ICMC 2018. Springer Proceedings in Mathematics & Statistics, vol 253. Springer, Singapore. https://doi.org/10.1007/978-981-13-2095-8_15

Download citation

DOI: https://doi.org/10.1007/978-981-13-2095-8_15
Published: 29 September 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2094-1
Online ISBN: 978-981-13-2095-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics