Machine Learning-Based Approaches Identify a Key Physicochemical Property for Accurately Predicting Polyadenlylation Signals in Genomic Sequences
Accurately predicting poly(A) signals (PASs) is one of important topics in bioinformatics for high-quality genome annotation and transcription regulation mechanism investigation. In this study, we identified a powerful physicochemical property of DNA sequence for computationally predicting PASs using machine learning technologies. On the basis of this feature, we built a PAS prediction model by capturing the position-specific information from the region surrounding PASs. The cross-validation results demonstrated that the prediction accuracies of our constructed model on 12 categories of human PASs are comparable to those of recently published PAS predictor Dragon PolyA Spotter. Further analysis revealed that the region 25 nucleotides downstream of PASs is the most important region for the accurate prediction of PASs.
Keywordspolyadenlylation site poly(A) genomic sequence physicochemical property dinucleotide random forest machine learning bioinformatics
Unable to display preview. Download preview PDF.
- 16.Rajagopal, N., Xie, W., Li, Y., Wagner, U., Wang, W., Stamatoyannopoulos, J., Ernst, J., Kellis, M., Ren, B.: RFECS: A Random-Forest based algorithm for enhancer Identification from chromatin state. PLoS Comput. Biol. 9, e1002968 (2013)Google Scholar
- 18.Wang, J., Kou, Z., Duan, M., Ma, C., Zhou, Y.: Using Amino Acid Factor Scores to Predict Avian-to-human Transmission of Avian Influenza Viruses: A Machine Learning Study. Protein and Peptide Letters (2013)Google Scholar
- 19.Touw, W.G., Bayjanov, J.R., Overmars, L., Backus, L., Boekhorst, J., Wels, M., van Hijum, S.A.: Data mining in the life sciences with random forest: a walk in the park or lost in the jungle? Brief. Bioinform (2012)Google Scholar