ENSEMBLE-CNN: Predicting DNA Binding Sites in Protein Sequences by an Ensemble Deep Learning Method
Detection of DNA binding sites in proteins plays an essential role in gene regulation processing. However, the difficult problem in developing machine learning predictors of DNA binding sites in protein is that: the number of DNA binding sites is significantly fewer than that of non-binding sites. Aiming to handle this issue, we propose a new predictor, named ENSEMBLE-CNN, which integrates instance selection and bootstrapping techniques for predicting imbalanced DNA-binding sites from protein primary sequences. ENSEMBLE-CNN uses a protein’s evolutionary information and sequence feature as two basic features and employs sampling strategy to deal with the class imbalance problem. Multiple initial predictors with CNNs as classifiers are trained by applying SMOTE and a random under-sampling technique to the original negative dataset. The final ensemble predictor is obtained by majority voting strategy. The results demonstrate that the proposed ENSEMBLE-CNN achieves high prediction accuracy and outperforms the existing sequence-based protein-DNA binding sites predictors.
KeywordsProtein-DNA binding sites Deep learning Ensemble method Imbalance learning
This work was supported in part by the National Natural Science Foundation of China under Grants (No. 61702058, 61772091), the China Postdoctoral Science Foundation funded project (No. 2017M612948), the Scientific Research Foundation for Advanced Talents of Chengdu University of Information Technology under Grant (No. KYTZ201717, KYTZ201715, KYTZ201750), the Scientific Research Foundation for Young Academic Leaders of Chengdu University of Information Technology under Grant (No. J201701, J201706), the Planning Foundation for Humanities and Social Sciences of Ministry of Education of China under Grant (No. 15YJAZH058), and the Innovative Research Team Construction Plan in Universities of Sichuan Province under Grant (No. 18TD0027).
- 12.Hu, J., Li, Y., Zhang, M., Yang, X., Shen, H.B., Yu, D.J.: Predicting protein-DNA binding residues by weightedly combining sequence-based features and boosting multiple SVMs. IEEE/ACM Trans. Comput. Biol. Bioinform. PP(99), 1389–1398 (2016)Google Scholar
- 16.Wong, K.C., Li, Y., Peng, C., Moses, A.M., Zhang, Z.: Computational learning on specificity-determining residue-nucleotide interactions. Nucleic Acids Res. 43(21), 10180–10189 (2015)Google Scholar