Analysis of protein features and machine learning algorithms for prediction of druggable proteins
- 153 Downloads
Computational tools have been widely used in drug discovery process since they reduce the time and cost. Prediction of whether a protein is druggable is fundamental and crucial for drug research pipeline. Sequence based protein function prediction plays vital roles in many research areas. Training data, protein features selection and machine learning algorithms are three indispensable elements that drive the successfulness of the models.
In this study, we tested the performance of different combinations of protein features and machine learning algorithms, based on FDA-approved small molecules’ targets, in druggable proteins prediction.We also enlarged the dataset to include the targets of small molecules that were in experiment or clinical investigation.
We found that although the 146-d vector used by Li et al. with neuron network achieved the best training accuracy of 91.10%, overlapped 3-gram word2vec with logistic regression achieved best prediction accuracy on independent test set (89.55%) and on newly approved-targets. Enlarged dataset with targets of small molecules in experiment and clinical investigation were trained. Unfortunately, the best training accuracy was only 75.48%. In addition, we applied our models to predict potential targets for references in future study.
Our study indicates the potential ability of word2vec in the prediction of druggable protein. And the training dataset of druggable protein should not be extended to targets that are lack of verification. The target prediction package could be found on https://doi.org/github.com/pkumdl/target_prediction.
Keywordsdruggable protein drug target word2vec deep learning
This work was supported in part by the Ministry of Science and Technology of China (No. 2016YFA0502303) and the National Natural Science Foundation of China (Nos. 21673010 and 81273436). The authors would like to thank Youjun Xu, Shuaishi Gao, Qiwan Hu for discussion and advices.
- 1.The UniProt Consortium. (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res., 45, D158–D169Google Scholar
- 9.Rose, P. W., Prlic, A., Altunkaya, A., Bi, C., Bradley, A. R., Christie, C. H., Costanzo, L. D., Duarte, J. M., Dutta, S. and Feng, Z. (2016) The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res., 45, D271–D281Google Scholar
- 20.Chung, Y.-A., Wu, C.-C., Shen, C.-H., Lee, H.-Y. and Lee, L.-S. (2016) Audio word2vec: unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. arXiv, 1603.00982Google Scholar
- 22.Kimothi, D., Soni, A., Biyani, P. and Hogan, J. M. (2016) Distributed Representations for Biological Sequence Analysis. arXiv:1608.05949Google Scholar