Abstract
Controlled biomolecular annotations are key concepts in computational genomics and proteomics, since they can describe the functional features of genes and their products in both a simple and computational way. Despite the importance of these annotations, many of them are missing, and the available ones contain errors and inconsistencies; furthermore, the discovery and validation of new annotations are very time-consuming tasks. For these reasons, recently many computer scientists developed several machine-learning algorithms able to computationally predict new gene-function relationships. While several of these methods have been easily adapted from different domains to bioinformatics, their validation remains a challenging aspect of a computational pipeline. Here, we propose a validation procedure based upon three different sub-phases, which is able to assess the precision of any algorithm predictions with a reliable degree of accuracy. We show some validation results obtained for Gene Ontology annotations of Homo sapiens genes that demonstrate the effectiveness of our validation approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
The Gene Ontology Consortium, Creating the Gene Ontology resource: Designand implementation. Genome Res. 11(8), 1425–1433 (2001)
Karp, P.D.: What we do not know about sequence analysis and sequence databases. Bioinformatics 14(9), 753–754 (1998)
Pandey, G., Kumar, V., Steinbach, M.: Computational Approaches for Protein Function Prediction: A Survey. Department of Computer Science and Engineering, University of Minnesota, Twin Cities (2006)
Chicco, D., Tagliasacchi, M., Masseroli, M.: Biomolecular annotation prediction through information integration. In: Proceedings of CIBB 2011 - 8th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics, Gargnagno sul Garda, Italy, pp. 1–9 (2011)
Chicco, D., Masseroli, M.: A discrete optimization approach for SVD best truncation choice based on ROC curves. In: Proceedings of IEEE BIBE - the 13th IEEE International Conference on Bioinformatics and Bioengineering, pp. 1–8. IEEE, Chania (2013)
Pinoli, P., Chicco, D., Masseroli, M.: Improved biomolecular annotation prediction through weighting scheme methods. In: Proceedings of CIBB - 10th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics, Nice, France, pp. 1–9 (2013)
Pinoli, P., Chicco, D., Masseroli, M.: Weighting scheme methods for enhanced genomic annotation prediction. In: Formenti, E., Tagliaferri, R., Wit, E. (eds.) CIBB 2013. LNCS, vol. 8452, pp. 76–89. Springer, Heidelberg (2014)
Pinoli, P., Chicco, D., Masseroli, M.: Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations. In: Proceedings of IEEE BIBE - the 13th IEEE International Conference on Bioinformatics and Bioengineering, pp. 1–8. IEEE, Chania (2013)
Pinoli, P., Chicco, D., Masseroli, M.: Latent Dirichlet allocation based on Gibbs sampling for gene function prediction. In: Proceedings of CIBCB - the IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, pp. 1–8. IEEE (2014)
Chicco, D., Sadowski, P., Baldi, P.: Deep autoencoder neural networks for Gene Ontology annotation predictions. In: Proceedings of ACM BCB, pp. 533–540. ACM (2014)
Pinoli, P., Chicco, D., Masseroli, M.: Computational algorithms to predict Gene Ontology annotations. BMC Bioinformatics 16(Suppl. 6), S4, 1–15 (2015)
Chicco, D., Masseroli, M.: Ontology-based prediction and prioritization of gene function annotations. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 248–260 (2016). IEEE
Khatri, P., Done, B., Rao, A., Done, A., Draghici, S.: A semantic analysis of the annotations of the human genome. Bioinformatics 21(16), 3416–3421 (2005)
Done, B., Khatri, P., Done, A., Draghici, S.: Semantic analysis of genome annotations using weighting schemes. In: Proceedings of CIBCB - the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pp. 212–218. IET, Honolulu (2007)
Done, B., Khatri, P., Done, A., Draghici, S.: Predicting novel human Gene Ontology annotations using semantic analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 7(1), 91–99 (2010)
King, O.D., Foulger, R.E., Dwight, S.S., White, J.V., Roth, F.P.: Predicting gene function from patterns of annotation. Genome Res. 13(5), 896–904 (2003)
Tao, Y., Sam, L., Li, J., Friedman, C., Lussier, Y.A.: Information theory applied to the sparse Gene Ontology annotation network to predict novel gene function. Bioinformatics 23(13), 529–538 (2007)
Barutcuoglu, Z., Schapire, R.E., Troyanskaya, O.G.: Hierarchical multi-label prediction of gene function. Bioinformatics 22(7), 830–836 (2006)
Chicco, D.: Computational Prediction of Gene Functions through Machine Learning methods and Multiple Validation Procedures, Doctoral Thesis, Politecnico di Milano (2014)
Fawcett, T.: ROC graphs: notes and practical considerations for researchers. ReCALL 31(HPL–2003–4), 1–38 (2004)
Canakoglu, A., Ghisalberti, G., Masseroli, M.: Integration of biomolecular interaction data in a genomic and proteomic data warehouse to support biomedical knowledge discovery. In: Biganzoli, E., Vellido, A., Ambrogi, F., Tagliaferri, R. (eds.) CIBB 2011. LNCS, vol. 7548, pp. 112–126. Springer, Heidelberg (2012)
Masseroli, M., Canakoglu, A., Ceri, S.: Integration and querying of genomic and proteomic semantic annotations for biomedical knowledge extraction. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 209–219 (2016). IEEE
Canakoglu, A., Ceri, S., Masseroli, M.: Biomolecular annotation integration and querying to help unveiling new biomedical knowledge. In: Ortuño, F., Rojas, I. (eds.) IWBBIO 2016. LNCS, vol. 9656, pp. 802–813. Springer, Heidelberg (2016)
Genomic and Proteomic Knowledge Base (GPKB). http://www.bioinformatics.deib.polimi.it/GPKB/
NCBI PubMed. http://www.ncbi.nlm.nih.gov/pubmed/
Carbon, S., Ireland, A., Mungall, C.J., Shu, S., Marshall, B., Lewis, S.: AmiGO: online access to ontology and annotation data. Bioinformatics 25(2), 288–289 (2009)
Rebhan, M., Chalifa-Caspi, V., Prilusky, J., Lancet, D.: GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14(88), 656–664 (1998)
Chicco, D., Masseroli, M.: Software suite for gene and protein annotation prediction and similarity search. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 12(4), 837–843 (2015)
Chicco, D.: Integration of bioinformatics web services through the Search Computing technology. Technical Report, TR 2012/02, Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milan, Italy
Masseroli, M., Picozzi, M., Ghisalberti, G., Ceri, S.: Explorative search of distributed bio-data to answer complex biomedical questions. BMC Bioinformatics 15(Suppl. 1), S3, 1–14 (2014)
Acknowledgments
This work was partially supported by the “Data–Driven Genomic Computing (GenData 2020)” PRIN project (2013–2015), funded by Italy’s Ministry of Education, Universities and Research (MIUR). Authors thank Coby Viner (University of Toronto) for his help in the English proof-reading of this article.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Chicco, D., Masseroli, M. (2016). Validation Pipeline for Computational Prediction of Genomics Annotations. In: Angelini, C., Rancoita, P., Rovetta, S. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2015. Lecture Notes in Computer Science(), vol 9874. Springer, Cham. https://doi.org/10.1007/978-3-319-44332-4_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-44332-4_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44331-7
Online ISBN: 978-3-319-44332-4
eBook Packages: Computer ScienceComputer Science (R0)