ENSEMBLE-CNN: Predicting DNA Binding Sites in Protein Sequences by an Ensemble Deep Learning Method

Zhang, Yongqing; Qiao, Shaojie; Ji, Shengjie; Zhou, Jiliu

doi:10.1007/978-3-319-95933-7_37

Yongqing Zhang^16,17,
Shaojie Qiao¹⁸,
Shengjie Ji¹⁶ &
…
Jiliu Zhou¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10955))

Included in the following conference series:

International Conference on Intelligent Computing

2367 Accesses
3 Citations

Abstract

Detection of DNA binding sites in proteins plays an essential role in gene regulation processing. However, the difficult problem in developing machine learning predictors of DNA binding sites in protein is that: the number of DNA binding sites is significantly fewer than that of non-binding sites. Aiming to handle this issue, we propose a new predictor, named ENSEMBLE-CNN, which integrates instance selection and bootstrapping techniques for predicting imbalanced DNA-binding sites from protein primary sequences. ENSEMBLE-CNN uses a protein’s evolutionary information and sequence feature as two basic features and employs sampling strategy to deal with the class imbalance problem. Multiple initial predictors with CNNs as classifiers are trained by applying SMOTE and a random under-sampling technique to the original negative dataset. The final ensemble predictor is obtained by majority voting strategy. The results demonstrate that the proposed ENSEMBLE-CNN achieves high prediction accuracy and outperforms the existing sequence-based protein-DNA binding sites predictors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Si, J., Zhao, R., Wu, R.: An overview of the prediction of protein DNA-binding sites. Int. J. Mol. Sci. 16(3), 5194–5215 (2015)
Article Google Scholar
Wong, K.C., Li, Y., Peng, C., Wong, H.S.: A comparison study for DNA motif modeling on protein binding microarray. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 261–271 (2016)
Article Google Scholar
Berger, M.F., Philippakis, A.A., Qureshi, A.M., He, F.S., Estep, P.W., Bulyk, M.L.: Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24(11), 1429–1435 (2006)
Article Google Scholar
Valouev, A., Johnson, D.S., Sundquist, A., Medina, C., Anton, E., Batzoglou, S., Myers, R.M., Sidow, A.: Genomewide analysis of transcription factor binding sites based on chip-seq data. Nat. Methods 5(9), 829–834 (2008)
Article Google Scholar
Ho, S.W., Jona, G., Chen, C.T., Johnston, M., Snyder, M.: Linking DNA-binding proteins to their recognition sequences by using protein microarrays. Proc. Nat. Acad. Sci. U.S.A. 103(26), 9940–9945 (2006)
Article Google Scholar
Wang, L., Brown, S.J.: BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res. 34(Web Server issue), W243 (2006)
Article Google Scholar
Wang, L., Huang, C., Yang, M.Q., Yang, J.Y.: BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Syst. Biol. 4(S1), S3 (2010)
Article Google Scholar
Chu, W.Y., Huang, Y.F., Huang, C.C., Cheng, Y.S., Huang, C.K., Oyang, Y.J.: ProteDNA: a sequence-based predictor of sequence-specific DNA-binding residues in transcription factors. Nucleic Acids Res. 37(Web Server issue), W396 (2009)
Article Google Scholar
Hwang, S., Gou, Z., Kuznetsov, I.B.: DP-bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics 23(5), 634–636 (2007)
Article Google Scholar
Si, J., Zhang, Z., Lin, B., Schroeder, M., Huang, B.: MetaDBSite: a meta approach to improve protein DNA-binding sites prediction. BMC Syst. Biol. 5(S1), S7 (2011)
Article Google Scholar
Li, B.Q., Feng, K.Y., Ding, J., Cai, Y.D.: Predicting DNA-binding sites of proteins based on sequential and 3D structural information. Mol. Genet. Genomics 289(3), 489–499 (2014)
Article Google Scholar
Hu, J., Li, Y., Zhang, M., Yang, X., Shen, H.B., Yu, D.J.: Predicting protein-DNA binding residues by weightedly combining sequence-based features and boosting multiple SVMs. IEEE/ACM Trans. Comput. Biol. Bioinform. PP(99), 1389–1398 (2016)
Google Scholar
Hu, J., Li, Y., Yan, W.X., Yang, J.Y., Shen, H.B., Yu, D.J.: KNN-based dynamic query-driven sample rescaling strategy for class imbalance learning. Neurocomputing 191, 363–373 (2016)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2011)
MATH Google Scholar
Ahmad, S., Gromiha, M.M., Sarai, A.: Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics 20(4), 477–486 (2004)
Article Google Scholar
Wong, K.C., Li, Y., Peng, C., Moses, A.M., Zhang, Z.: Computational learning on specificity-determining residue-nucleotide interactions. Nucleic Acids Res. 43(21), 10180–10189 (2015)
Google Scholar
Schffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., Altschul, S.F.: Improving the accuracy of psi-blast protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 29(14), 2994–3005 (2001)
Article Google Scholar
Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28(1), 45–48 (2000)
Article Google Scholar
Huang, D.-S.: Radial basis probabilistic neural networks: model and application. Int. J. Pattern Recogn. Artif. Intell. 13(07), 1083–1101 (1999)
Article Google Scholar
Huang, D.S., Du, J.X.: A constructive hybrid structure optimization methodology for radial basis probabilistic neural networks. IEEE Trans. Neural Netw. 19(12), 2099–2115 (2008)
Article Google Scholar
Zhang, J.-R., Zhang, J., Lok, T.-M., Lyu, M.R.: A hybrid particle swarm optimization–back-propagation algorithm for feedforward neural network training. Appl. Math. Comput. 185(2), 1026–1037 (2007)
MATH Google Scholar
Huang, D.-S.: A constructive approach for finding arbitrary roots of polynomials by neural networks. IEEE Trans. Neural Netw. 15(2), 477–491 (2004)
Article Google Scholar

Download references

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grants (No. 61702058, 61772091), the China Postdoctoral Science Foundation funded project (No. 2017M612948), the Scientific Research Foundation for Advanced Talents of Chengdu University of Information Technology under Grant (No. KYTZ201717, KYTZ201715, KYTZ201750), the Scientific Research Foundation for Young Academic Leaders of Chengdu University of Information Technology under Grant (No. J201701, J201706), the Planning Foundation for Humanities and Social Sciences of Ministry of Education of China under Grant (No. 15YJAZH058), and the Innovative Research Team Construction Plan in Universities of Sichuan Province under Grant (No. 18TD0027).

Author information

Authors and Affiliations

School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
Yongqing Zhang, Shengjie Ji & Jiliu Zhou
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 610054, China
Yongqing Zhang
School of Cybersecurity, Chengdu University of Information Technology, Chengdu, 610225, China
Shaojie Qiao

Authors

Yongqing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shaojie Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Shengjie Ji
View author publications
You can also search for this author in PubMed Google Scholar
Jiliu Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shaojie Qiao .

Editor information

Editors and Affiliations

Tongji University, Shanghai, China
De-Shuang Huang
University of Ulsan, Ulsan, Korea (Republic of)
Kang-Hyun Jo
Wuhan University of Science and Technology, Wuhan City, China
Xiao-Long Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Qiao, S., Ji, S., Zhou, J. (2018). ENSEMBLE-CNN: Predicting DNA Binding Sites in Protein Sequences by an Ensemble Deep Learning Method. In: Huang, DS., Jo, KH., Zhang, XL. (eds) Intelligent Computing Theories and Application. ICIC 2018. Lecture Notes in Computer Science(), vol 10955. Springer, Cham. https://doi.org/10.1007/978-3-319-95933-7_37

Download citation

DOI: https://doi.org/10.1007/978-3-319-95933-7_37
Published: 06 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-95932-0
Online ISBN: 978-3-319-95933-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics