Feature Combination Methods for Prediction of Subcellular Locations of Proteins with Both Single and Multiple Sites
Effective feature extraction methods play very important role for prediction of multisite protein subcellular locations. With the progress of many proteome projects, more and more proteins are annotated with more than one subcellular location. However, compared with the problems of single-site protein, the problems of multiplex protein subcellular localizations are far more difficult and complicated to deal with. To improve the multisite prediction quality, it is necessary to incorporate different feature extraction methods. In this paper, a version of feature combination method which is to make use of the 20 dimensions of entropy density instead of the former 20 dimensions of amphiphilic pseudo amino acid composition (AmPseAAC), is used in two different datasets. It is different from the way of simple dimensions additive feature fusion. On base of this novel feature combination method, we adopt the multi-label k-nearest neighbors (ML-KNN) algorithm and setting different weights into different attributes’ ML-KNN, which is called wML-KNN, to predict multiplex protein subcellular locations. The best overall accuracy rate on dataset S1 from the predictor of Virus-mPLoc is 61.11 % and 82.03 % on dataset S2 from Gpos-mPLoc, respectively.
KeywordsMultisite protein subcellular localizations The entropy density AmPseAAC Multi-label k-nearest neighbors algorithm wML-KNN
This research was partially supported by the Science and Technology Foundation of University of Jinan (Grant No. XKY1402), Shandong Provincial Natural Science Foundation, China, under Grant ZR2015JL025, the Youth Project of National Natural Science Fund (Grant No. 61302128), the Youth Science and Technology Star Program of Jinan City (201406003), the Natural Science Foundation of Shandong Province (ZR2011FL022, ZR2013FL002), the Scientific Research Fund of Jinan University (XKY1410, XKY1411), the Program for Scientific research innovation team in Colleges and Universities of Shandong Province (2012–2015), and the Shandong Provincial Key Laboratory of Network Based Intelligent Computing.
- 2.Du, P.F., Xu, C.: Predicting multisite protein subcellular locations: progress and challenges. Proteomics 10(3), 227–237 (2013)Google Scholar
- 5.Su, C.Y., Lo, A., Lin, C.C., et al.: A novel approach for prediction of multi-labeled protein subcellular localization for prokaryotic bacteria. In: Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference Workshops, Stanford, California, 8–12 August, pp. 79–80. IEEE, Piscataway (2005)Google Scholar
- 6.Zhu, H.Q., She, Z.S., Wang, J.: An EDP-based description of DNA sequences and its application in identification of exons in human genome. In: The Second Chinese Bioinformatics Conference Proceedings, Beijing, pp. 23–24 (2002)Google Scholar
- 11.Shen, Z.B., Bai, Q.Y.: KNN text classification method based on weight modify. Comput. Sci. 35(10), 123–126 (2008)Google Scholar
- 12.Qu, X., Chen, Y., Qiao, S., Wang, D., Zhao, Q.: Predicting the subcellular localization of proteins with multiple sites based on multiple features fusion. In: Huang, D.-S., Han, K., Gromiha, M. (eds.) ICIC 2014. LNCS, vol. 8590, pp. 456–465. Springer, Heidelberg (2014)Google Scholar