Abstract
Bioinformatics is now one of the most important fields of modern sciences grouping different fields of research such as Biology, Genomics, Genetics and Molecular evolution. These fields generate a large amount of information via the utilization of the new generations of sequencing techniques (NGS). This amount of data requires the development of a new generation of tools able to store and analyze efficiently and rapidly the information. Coffea canephora also called the Robusta coffee is one of the most important tree for tropical countries. This genome has been recently sequenced. One of the characteristics of this genome is the presence of numerous repeated elements, representing more than 50% of the genome sequence. The analysis and classification of such amount of repeated sequences require innovative approaches. Here, we present how data mining and machine learning can contribute to process sequencing data for the fast classification of a class of repeated sequences, called transposable elements.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Bash is a user interface to Unix operating system that accepts commands and generally produces text-based output [46].
- 2.
Fasta format is a standard for sequence files that each sequence has an identity line beginning with > character followed by its nucleotides [http://www.ncbi.nlm.nih.gov/blast/fasta.shtml].
References
López-Gartner, G., Agudelo-Valencia, D., Castaño, S., Isaza, G.A., Castillo, L.F., Sánchez, M., Arango, J.: Identification of a putative ganoderic acid pathway enzyme in a Ganoderma Australe transcriptome by means of a Hidden Markov Model. In: Overbeek, R., Rocha, M.P., Fdez-Riverola, F., Paz, J.F. (eds.) 9th International Conference on Practical Applications of Computational Biology and Bioinformatics. AISC, vol. 375, pp. 107–115. Springer, Cham (2015). doi:10.1007/978-3-319-19776-0_12
Orozco, S., Jeferson, A.: Aplication of artificial intelligence in bioinformatics, advances, definitions and tools. UGCiencia 22, 159–171 (2016)
Castillo, L.F., López-gartner, G., Isaza, G.A., Sánchez, M.: GITIRBio: a semantic and distributed service oriented-architecture for bioinformatics pipeline. J. Integr. Bioinform. 12, 1–15 (2015)
Blankenberg, D., Von Kuster, G., Coraor, N., Ananda, G., Lazarus, R., Mangan, M., Nekrutenko, A., Taylor, J.: Galaxy: a web-based genome analysis tool for experimentalists. Curr. Protoc. Mol. Biol. 1–21 (2010)
Sumathi, S., Sivanandam, S.N.: Introduction to Data Mining Principles. Springer, Heidelberg (2006). doi:10.1007/978-3-540-34351-6
Markov, Z., Russell, I.: An introduction to the WEKA data mining system. ACM SIGCSE Bull. 38, 367–368 (2006)
Jaffar, J., Michaylov, S., Stuckey, P.J., Yap, R.H.C.: The CLP(R) language and system. ACM Trans. Program. Lang. Syst. 14, 339 (1992)
Guyot, R., Darré, T., Dupeyron, M., de Kochko, A., Hamon, S., Couturon, E., Crouzillat, D., Rigoreau, M., Rakotomalala, J.J., Raharimalala, N.E., Akaffou, S.D., Hamon, P.: Partial sequencing reveals the transposable element composition of Coffea genomes and provides evidence for distinct evolutionary stories. Mol. Genet. Genomics 291, 1979–1990 (2016)
Muszewska, A., Hoffman-Sommer, M., Grynberg, M.: LTR retrotransposons in fungi. PLoS One 6 (2011)
Beulé, T., Agbessi, M.D., Dussert, S., Jaligot, E., Guyot, R.: Genome-wide analysis of LTR-retrotransposons in oil palm. BMC Genom. 16, 1–14 (2015)
Denoeud, F., Carretero-Paulet, L., Dereeper, A., Droc, G., Guyot, R., Pietrella, M., Zheng, C., Alberti, A., Anthony, F., Aprea, G., Aury, J.-M., Bento, P., Bernard, M., Bocs, S., Campa, C., Cenci, A., Combes, M.-C., Crouzillat, D., Da Silva, C., Daddiego, L., De Bellis, F., Dussert, S., Garsmeur, O., Gayraud, T., Guignon, V., Jahn, K., Jamilloux, V., Joët, T., Labadie, K., Lan, T., Leclercq, J., Lepelley, M., Leroy, T., Li, L.-T., Librado, P., Lopez, L., Muñoz, A., Noel, B., Pallavicini, A., Perrotta, G., Poncet, V., Pot, D., Priyono, Rigoreau, M., Rouard, M., Rozas, J., Tranchant-Dubreuil, C., VanBuren, R., Zhang, Q., Andrade, A.C., Argout, X., Bertrand, B., de Kochko, A., Graziosi, G., Henry, R.J., Jayarama, Ming, R., Nagai, C., Rounsley, S., Sankoff, D., Giuliano, G., Albert, V.A., Wincker, P., Lashermes, P.: The coffee genome provides insight into the convergent evolution of caffeine biosynthesis. Science 345, 1181–1184 (2014)
Chaparro, C., Gayraud, T., De Souza, R.F., Domingues, D.S., Akaffou, S., Vanzela, A.L.L., De Kochko, A., Rigoreau, M., Crouzillat, D., Hamon, S., Hamon, P., Guyot, R.: Terminal-repeat retrotransposons with gAG domain in plant genomes: a new testimony on the complex world of transposable elements. Genome Biol. Evol. 7, 493–504 (2015)
Guyot, R., de la Mare, M., Viader, V., Hamon, P., Coriton, O., Bustamante-porras, J., Poncet, V., Campa, C., Hamon, S., de Kochko, A.: Microcollinearity in an ethylene receptor coding gene region of the Coffea canephora genome is extensively conserved with Vitis vinifera and other distant dicotyledonous sequenced genomes. BMC Plant Biol. 9, 1–15 (2009)
Esteves Vieira, L.G., Andrade, A.C., Colombo, C.A., De Araújo Moraes, A.H., Metha, Â., De Oliveira, A.C., Labate, C.A., Marino, C.L., Monteiro-Vitorello, C.D.B., Monte, D.D.C., Giglioti, É., Kimura, E.T., Romano, E., Kuramae, E.E., Macedo Lemos, E.G., Pereira De Almeida, E.R., Jorge, É.C., Albuquerque, É.V.S., Da Silva, F.R., Da Vinecky, F., Sawazaki, H.E., Dorry, H.F.A., Carrer, H., Abreu, I.N., Batista, J.A.N., Teixeira, J.B., Kitajima, J.P., Xavier, K.G., De Lima, L.M., Aranha De Camargo, L.E., Protasio Pereira, L.F., Coutinho, L.L., Franco Lemos, M.V., Romano, M.R., Machado, M.A., Do Carmo Costa, M.M., Grossi De Sá, M.F., Goldman, M.H.S., Ferro, M.I.T., Penha Tinoco, M.L., Oliveira, M.C., Van Sluys, M.A., Shimizu, M.M., Maluf, M.P., Souza Da Eira, M.T., Guerreiro Filho, O., Arruda, P., Mazzafera, P., Correa Mariani, P.D.S., De Oliveira, R.L.B.C., Harakava, R., Balbao, S.F., Siu, M.T., Zingaretti Di Mauro, S.M., Santos, S.N., Siqueira, W.J., Lacerda Costa, G.G., Formighieri, E.F., Carazzolle, M.F., Guimarães Pereira, G.A.: Brazilian coffee genome project: An EST-based genomic resource. Brazilian J. Plant Physiol. 18, 95–108 (2006)
Dereeper, A., Guyot, R., Tranchant-Dubreuil, C., Anthony, F., Argout, X., de Bellis, F., Combes, M.C., Gavory, F., de Kochko, A., Kudrna, D., Leroy, T., Poulain, J., Rondeau, M., Song, X., Wing, R., Lashermes, P.: BAC-end sequences analysis provides first insights into coffee (Coffea canephora P.) genome composition and evolution. Plant Mol. Biol. 83, 177–189 (2013)
Leroy, T., Marraccini, P., Dufour, M., Montagnon, C., Lashermes, P., Sabau, X., Ferreira, L.P., Jourdan, I., Pot, D., Andrade, A.C., Glaszmann, J.C., Vieira, L.G.E., Piffanelli, P.: Construction and characterization of a Coffea canephora BAC library to study the organization of sucrose biosynthesis genes. Theor. Appl. Genet. 111, 1032–1041 (2005)
Yu, Q., Guyot, R., De Kochko, A., Byers, A., Navajas-Pérez, R., Langston, B.J., Dubreuil-Tranchant, C., Paterson, A.H., Poncet, V., Nagai, C., Ming, R.: Micro-collinearity and genome evolution in the vicinity of an ethylene receptor gene of cultivated diploid and allotetraploid coffee species (Coffea). Plant J. 67, 305–317 (2011)
Llorens, C., Futami, R., Covelli, L., Domínguez-Escribá, L., Viu, J.M., Tamarit, D., Aguilar-Rodríguez, J., Vicente-Ripolles, M., Fuster, G., Bernet, G.P., et al.: The Gypsy Database (GyDB) of mobile genetic elements: release 2.0. Nucleic Acids Res. (2010). doi:10.1093/nar/gkq1061
Wicker, T., Sabot, F., Hua-Van, A., Bennetzen, J.L., Capy, P., Chalhoub, B., Flavell, A., Leroy, P., Morgante, M., Panaud, O., Paux, E., SanMiguel, P., Schulman, A.H.: A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 8, 973–982 (2007)
Witte, C.-P., Le, Q.H., Bureau, T., Kumar, A.: Terminal-repeat retrotransposons in miniature (TRIM) are involved in restructuring plant genomes. Proc. Natl. Acad. Sci. 98, 13778–13783 (2001)
Kalendar, R., Vicient, C.M., Peleg, O., Anamthawat-Jonsson, K., Bolshoy, A., Schulman, A.H.: Large retrotransposon derivatives: abundant, conserved but nonautonomous retroelements of barley and related genomes. Genetics 166, 1437–1450 (2004)
Tanskanen, J.A., Sabot, F., Vicient, C., Schulman, A.H.: Life without GAG: the BARE-2 retrotransposon as a parasite’s parasite. Gene 390, 166–174 (2007)
Quesneville, H., Bergman, C.M., Andrieu, O., Autard, D., Nouaud, D., Ashburner, M., Anxolabehere, D.: Combined evidence annotation of transposable elements in genome sequences. PLoS Comput. Biol. 1, 166–175 (2005)
Price, A.L., Jones, N.C., Pevzner, P.A.: De novo identification of repeat families in large genomes. Bioinformatics 21, 351–358 (2005)
Ellinghaus, D., Kurtz, S., Willhoeft, U.: LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinform. 9, 18 (2008)
McCarthy, E.M., McDonald, J.F.: LTR_STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics 19, 362–367 (2003)
Xu, Z., Wang, H.: LTR-FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, 265–268 (2007)
Disdero, E., Filée, J.: LoRTE: detecting transposon-induced genomic variants using low coverage PacBio long read sequences. Mob. DNA 8, 5 (2017)
Zeng, F.-C., Zhao, Y.-J., Zhang, Q.-J., Gao, L.-Z.: LTRtype, an efficient tool to characterize structurally complex LTR retrotransposons and nested insertions on genomes. Front. Plant Sci. 8, 1–9 (2017)
Hoede, C., Arnoux, S., Moisset, M., Chaumier, T., Inizan, O., Jamilloux, V., Quesneville, H.: PASTEC: an automatic transposable element classification tool. PLoS One 9, 1–6 (2014)
Steinbiss, S., Kastens, S., Kurtz, S.: LTRsift: a graphical user interface for semi-automatic classification and postprocessing of de novo detected LTR retrotransposons. Mob. DNA. 3, 18 (2012)
Du, J., Tian, Z., Hans, C.S., Laten, H.M., Cannon, S.B., Jackson, S.A., Shoemaker, R.C., Ma, J.: Evolutionary conservation, diversity and specificity of LTR-retrotransposons in flowering plants: insights from genome-wide analysis and multi-specific comparison. Plant J. 63, 584–598 (2010)
Vitte, C., Bennetzen, J.L.: Analysis of retrotransposon structural diversity uncovers properties and propensities in angiosperm genome evolution. Proc. Natl. Acad. Sci. 103, 17638–17643 (2006)
Dupeyron, M., de Souza, R.F., Hamon, P., de Kochko, A., Crouzillat, D., Couturon, E., Domingues, D.S., Guyot, R.: Distribution of Divo in Coffea genomes, a poorly described family of angiosperm LTR-Retrotransposons. Mol. Genet. Genomics 292, 741–754 (2017)
Zhang, Q.-J., Gao, L.-Z.: Rapid and recent evolution of LTR retrotransposons drives rice genome evolution during the speciation of AA-genome Oryza species. G3 Genes Genomes Genet. 7, 1875–1885 (2017)
Llorens, C., Muñoz-Pomer, A., Bernad, L., Botella, H., Moya, A.: Network dynamics of eukaryotic LTR retroelements beyond phylogenetic trees. Biol. Direct. 4, 41 (2009)
Garavito, A., Montagnon, C., Guyot, R., Bertrand, B.: Identification by the DArTseq method of the genetic origin of the Coffea canephora cultivated in Vietnam and Mexico. BMC Plant Biol. 16, 242 (2016)
Carneiro, F.A., Rego, E., Aquino, S.O., Costa, T.S., Lima, E.A., Rocha, O.C., Rodrigues, G.C., Carvalho, M.A.F., Veiga, A.D., Guerra, A.F., et al.: Genome wide association study for drought tolerance and other agronomic traits of a# Coffea canephora# population (2015)
Babova, O., Occhipinti, A., Maffei, M.E.: Chemical partitioning and antioxidant capacity of green coffee (Coffea arabica and Coffea canephora) of different geographical origin. Phytochemistry 123, 33–39 (2016)
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. AI Mag. 17, 37–54 (1996)
Denoeud, F., Carretero-Paulet, L., Dereeper, A., Droc, G., Guyot, R., Pietrella, M., Zheng, C., Alberti, A., Anthony, F., Aprea, G., et al.: The coffee genome provides insight into the convergent evolution of caffeine biosynthesis. Science 345(6201), 1181–1184 (2014)
Rice, P., Longden, I., Bleasby, A.: EMBOSS: the European molecular biology open software suite. Trends Genet. 16, 276–277 (2000)
Jurka, J., Klonowski, P., Dagman, V., Pelton, P.: CENSOR—a program for identification and elimination of repetitive elements from DNA sequences. Comput. Chem. 20, 119–121 (1996)
Moine, J.M.: Metodologías para el descubrimiento de conocimiento en bases de datos: un estudio comparativo (2013)
Carreño, J.A.: Descubrimiento de conocimiento en los negocios (2008)
Newham, C., Rosenblatt, B.: Learning the Bash Shell: Unix Shell Programming. O’Reilly Media Inc., Sebastopol (2005)
Acknowledgements
We thank the Centro de Bioinformática y Biología Computacional BIOS for using the supercomputer to process the dataset.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Arango-López, J., Orozco-Arias, S., Salazar, J.A., Guyot, R. (2017). Application of Data Mining Algorithms to Classify Biological Data: The Coffea canephora Genome Case. In: Solano, A., Ordoñez, H. (eds) Advances in Computing. CCC 2017. Communications in Computer and Information Science, vol 735. Springer, Cham. https://doi.org/10.1007/978-3-319-66562-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-66562-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66561-0
Online ISBN: 978-3-319-66562-7
eBook Packages: Computer ScienceComputer Science (R0)