Machine Learning-Based Approaches Identify a Key Physicochemical Property for Accurately Predicting Polyadenlylation Signals in Genomic Sequences

  • HaiBo Cui
  • Jia Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7996)


Accurately predicting poly(A) signals (PASs) is one of important topics in bioinformatics for high-quality genome annotation and transcription regulation mechanism investigation. In this study, we identified a powerful physicochemical property of DNA sequence for computationally predicting PASs using machine learning technologies. On the basis of this feature, we built a PAS prediction model by capturing the position-specific information from the region surrounding PASs. The cross-validation results demonstrated that the prediction accuracies of our constructed model on 12 categories of human PASs are comparable to those of recently published PAS predictor Dragon PolyA Spotter. Further analysis revealed that the region 25 nucleotides downstream of PASs is the most important region for the accurate prediction of PASs.


polyadenlylation site poly(A) genomic sequence physicochemical property dinucleotide random forest machine learning bioinformatics 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Fuke, H., Ohno, M.: Role of poly (A) tail as an identity element for mRNA nuclear export. Nucleic Acids Res. 36, 1037–1049 (2008)CrossRefGoogle Scholar
  2. 2.
    Kuehner, J.N., Pearson, E.L., Moore, C.: Unravelling the means to an end: RNA polymerase II transcription termination. Nature reviews. Mol. Cell Biol. 12, 283–294 (2011)CrossRefGoogle Scholar
  3. 3.
    Beaudoing, E., Freier, S., Wyatt, J.R., Claverie, J.M., Gautheret, D.: Patterns of variant polyadenylation signal usage in human genes. Genome Res. 10, 1001–1010 (2000)CrossRefGoogle Scholar
  4. 4.
    Ji, G., Wu, X., Shen, Y., Huang, J., Quinn Li, Q.: A classification-based prediction model of messenger RNA polyadenylation sites. J. Theor. Biol. 265, 287–296 (2010)CrossRefGoogle Scholar
  5. 5.
    Goni, J., Zheng, J., Shen, Y., Wu, X., Jiang, R., Lin, Y., Loke, J.C., Davis, K.M., Reese, G.J., Li, Q.Q.: Predictive modeling of plant messenger RNA polyadenylation sites. BMC Bioinformatics 8, 43 (2007)CrossRefGoogle Scholar
  6. 6.
    Chang, T.H., Wu, L.C., Chen, Y.T., Huang, H.D., Liu, B.J., Cheng, K.F., Horng, J.T.: Characterization and prediction of mRNA polyadenylation sites in human genes. Med. Biol. Eng. Comput. 49, 463–472 (2011)CrossRefGoogle Scholar
  7. 7.
    Cheng, Y., Miura, R.M., Tian, B.: Prediction of mRNA polyadenylation sites by support vector machine. Bioinformatics 22, 2320–2325 (2006)CrossRefGoogle Scholar
  8. 8.
    Wu, X., Ji, G., Zeng, Y.: In silico prediction of mRNA poly(A) sites in Chlamydomonas reinhardtii. Mol. Genet. Genomics 287, 895–907 (2012)CrossRefGoogle Scholar
  9. 9.
    Kalkatawi, M., Rangkuti, F., Schramm, M., Jankovic, B.R., Kamau, A., Chowdhary, R., Archer, J.A., Bajic, V.B.: Dragon PolyA Spotter: predictor of poly(A) motifs within human genomic DNA sequences. Bioinformatics 28, 127–129 (2012)CrossRefGoogle Scholar
  10. 10.
    Ho, E.S., Gunderson, S.I., Duffy, S.: A multispecies polyadenylation site model. BMC Bioinformatics 14(suppl. 2), S9 (2013)CrossRefGoogle Scholar
  11. 11.
    Goni, J.R., Perez, A., Torrents, D., Orozco, M.: Determining promoter location based on DNA structure first-principles calculations. Genome Bio. 8, R263 (2007)CrossRefGoogle Scholar
  12. 12.
    Xu, B., Schones, D.E., Wang, Y., Liang, H., Li, G.: A structural-based strategy for recognition of transcription factor binding sites. PloS One 8, e52460 (2013)CrossRefGoogle Scholar
  13. 13.
    Friedel, M., Nikolajewa, S., Suhnel, J., Wilhelm, T.: DiProDB: a database for dinucleotide properties. Nucleic Acids Res. 37, D37–D40 (2009)CrossRefGoogle Scholar
  14. 14.
    Ma, C., Chen, H., Xin, M., Yang, R., Wang, X.: KGBassembler: a karyotype-based genome assembler for Brassicaceae species. Bioinformatics 28, 3141–3143 (2012)CrossRefGoogle Scholar
  15. 15.
    Gan, Y., Guan, J., Zhou, S.: A comparison study on feature selection of DNA structural properties for promoter prediction. BMC Bioinformatics 13, 4 (2012)CrossRefGoogle Scholar
  16. 16.
    Rajagopal, N., Xie, W., Li, Y., Wagner, U., Wang, W., Stamatoyannopoulos, J., Ernst, J., Kellis, M., Ren, B.: RFECS: A Random-Forest based algorithm for enhancer Identification from chromatin state. PLoS Comput. Biol. 9, e1002968 (2013)Google Scholar
  17. 17.
    Li, Z.C., Lai, Y.H., Chen, L.L., Chen, C., Xie, Y., Dai, Z., Zou, X.Y.: Identifying subcellular localizations of mammalian protein complexes based on graph theory with a random forest algorithm. Mol. Biosyst. 9, 658–667 (2013)CrossRefGoogle Scholar
  18. 18.
    Wang, J., Kou, Z., Duan, M., Ma, C., Zhou, Y.: Using Amino Acid Factor Scores to Predict Avian-to-human Transmission of Avian Influenza Viruses: A Machine Learning Study. Protein and Peptide Letters (2013)Google Scholar
  19. 19.
    Touw, W.G., Bayjanov, J.R., Overmars, L., Backus, L., Boekhorst, J., Wels, M., van Hijum, S.A.: Data mining in the life sciences with random forest: a walk in the park or lost in the jungle? Brief. Bioinform (2012)Google Scholar
  20. 20.
    Gartenberg, M.R., Crothers, D.M.: DNA sequence determinants of CAP-induced bending and protein binding affinity. Nature 333, 824–829 (1988)CrossRefGoogle Scholar
  21. 21.
    Rosonina, E., Kaneko, S., Manley, J.L.: Terminating the transcript: breaking up is hard to do. Genes Dev. 20, 1050–1056 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • HaiBo Cui
    • 1
  • Jia Wang
    • 2
  1. 1.Faculty of Mathematics and Computer ScienceHubei UniversityWuhanChina
  2. 2.College of ScienceHuazhong Agricultural UniversityWuhanChina

Personalised recommendations