Abstract
The analysis of genomic data has been used to generate information about genetic variants and expression patterns correlated with specific physical traits. In the last decades, these analyses have evolved toward analyzing thousands of entities at the same time. Moreover, these analyses have produced an enormous amount of biomedical literature reporting associations between genes and diseases. In this scenario, pattern recognition techniques have been truly useful, so a review of how these techniques have been applied is relevant. Thus, in this chapter we present a brief introduction to the high-throughput sequencing methodologies. Then, we describe the process of identification of genomic variants and genetic expression profiles that have been used for the diagnostic of diseases, followed by a general overview of the gene-disease association extraction from biomedical literature.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
International Human Genome Sequencing Consortium. (2004). Finishing the euchromatic sequence of the human genome. Nature, 431, 931–945.
Taub, F. E., DeLeo, J. M., & Thompson, B. E. (1983). Sequential comparative hybridizations analyzed by computerized image processing can identify and quantitate regulated RNAs. DNA, 2(4), 309–327.
Churchill, G. A. (2002). Fundamentals of experimental design for cDNA microarrays. Nature Genetics, 32, 490.
International Human Genome Sequencing Consortium. (2001). Initial sequencing and analysis of the human genome. Nature, 409, 860–921.
Shendure, J., Balasubramanian, S., Church, G. M., Gilbert, W., Rogers, J., Schloss, J. A., & Waterston, R. H. (2017). Dna sequencing at 40: Past, present and future. Nature, 550(7676), 345.
Zhang, J., Chiodini, R., Badr, A., & Zhang, G. (2011). The impact of next-generation sequencing on genomics. Journal of Genetics and Genomics, 38(3), 95–109.
Goodwin, S., McPherson, J. D., & McCombie, W. R. (2016). Coming of age: Ten years of next-generation sequencing technologies. Nature Reviews Genetics, 17(6), 333.
Drmanac, R., Sparks, A. B., Callow, M. J., Halpern, A. L., Burns, N. L., Kermani, B. G., Carnevali, P., Nazarenko, I., Nilsen, G. B., Yeung, G., et al. (2010). Human genome sequencing using unchained base reads on self-assembling dna nanoarrays. Science, 327(5961), 78–81.
Valouev, A., Ichikawa, J., Tonthat, T., Stuart, J., Ranade, S., Peckham, H., Zeng, K., Malek, J. A., Costa, G., McKernan, K., et al. (2008). A high-resolution, nucleosome position map of C. Elegans reveals a lack of universal sequence-dictated positioning. Genome Research, 18(7), 1051–1063.
Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L. A., Berka, J., Braverman, M. S., Chen, Y.-J., Chen, Z., et al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057), 376.
Rothberg, J. M., Hinz, W., Rearick, T. M., Schultz, J., Mileski, W., Davey, M., Leamon, J. H., Johnson, K., Milgrew, M. J., Edwards, M., et al. (2011). An integrated semiconductor device enabling non-optical genome sequencing. Nature, 475(7356), 348.
Clarke, J., Wu, H.-C., Jayasinghe, L., Patel, A., Reid, S., & Bayley, H. (2009). Continuous base identification for single-molecule nanopore dna sequencing. Nature Nanotechnology, 4(4), 265.
Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., Peluso, P., Rank, D., Baybayan, P., Bettman, B., et al. (2009). Real-time dna sequencing from single polymerase molecules. Science, 323(5910), 133–138.
Stephens, Z. D., Lee, S. Y., Faghri, F., Campbell, R. H., Zhai, C., Efron, M. J., Iyer, R., Schatz, M. C., Sinha, S., & Robinson, G. E. (2015). Big data: Astronomical or genomical? PLoS Biology, 13(7), e1002195.
Consortium, G. P. (2010). A map of human genome variation from population-scale sequencing. Nature, 467(7319), 1061.
McVean, G., Altshuler (Co-Chair), D., Durbin (Co-Chair), R. et al. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65.
Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics, 12(6), 443.
DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., Philippakis, A. A., Del Angel, G., Rivas, M. A., Hanna, M., et al. (2011). A framework for variation discovery and genotyping using next-generation dna sequencing data. Nature Genetics, 43(5), 491.
Schbath, S., Martin, V., Zytnicki, M., Fayolle, J., Loux, V., & Gibrat, J.-F. (2012). Mapping reads on a genomic sequence: An algorithmic overview and a practical comparative analysis. Journal of Computational Biology, 19(6), 796–813.
Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology, 10(3), R25.
Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), 357.
Lunter, G., & Goodson, M. (2011). Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Research, 21(6), 936–939.
Ning, Z., Cox, A. J., & Mullikin, J. C. (2001). SSAHA: A fast search method for large dna databases. Genome Research, 11(10), 1725–1729.
Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., & Wang, J. (2009). Soap2: An improved ultrafast tool for short read alignment. Bioinformatics, 25(15), 1966–1967.
Jiang, H., & Wong, W. H. (2008). SeqMap: Mapping massive amount of oligonucleotides to the genome. Bioinformatics, 24(20), 2395–2396.
David, M., Dzamba, M., Lister, D., Ilie, L., & Brudno, M. (2011). Shrimp2: Sensitive yet practical short read mapping. Bioinformatics, 27(7), 1011–1012.
Rizk, G., & Lavenier, D. (2010). Gassst: Global alignment short sequence search tool. Bioinformatics, 26(20), 2534–2540.
Rivals, E., Salmela, L., Kiiskinen, P., Kalsi, P., & Tarhio, J. (2009). Mpscan: Fast localisation of multiple reads in genomes (In: Salzberg S.L., Warnow T. (eds) algorithms in bioinformatics. WABI 2009. Pp. 246–260. Lecture notes in computer science, vol 5724). Berlin, Heidelberg: Springer.
Li, H., & Durbin, R. (2010). Fast and accurate long-read alignment with burrows–wheeler transform. Bioinformatics, 26(5), 589–595.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., & Durbin, R. (2009). The sequence alignment/map format and samtools. Bioinformatics, 25(16), 2078–2079.
Li, H. (2011). A statistical framework for snp calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987–2993.
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., et al. (2010). The genome analysis toolkit: A mapreduce framework for analyzing next-generation dna sequencing data. Genome Research, 20(9), 1297–1303.
Narzisi, G., O’rawe, J. A., Iossifov, I., Fang, H., Lee, Y.-h., Wang, Z., Wu, Y., Lyon, G. J., Wigler, M., & Schatz, M. C. (2014). Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nature Methods, 11(10), 1033.
Ewing, B., Hillier, L., Wendl, M. C., & Green, P. (1998). Base-calling of automated sequencer traces usingphred. i. accuracy assessment. Genome Research, 8(3), 175–185.
Ross, M. G., Russ, C., Costello, M., Hollinger, A., Lennon, N. J., Hegarty, R., Nusbaum, C., & Jaffe, D. B. (2013). Characterizing and measuring bias in sequence data. Genome Biology, 14(5), R51.
Harismendy, O., Ng, P. C., Strausberg, R. L., Wang, X., Stockwell, T. B., Beeson, K. Y., Schork, N. J., Murray, S. S., Topol, E. J., Levy, S., et al. (2009). Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biology, 10(3), R32.
Wang, J., Wang, W., Li, R., Li, Y., Tian, G., Goodman, L., Fan, W., Zhang, J., Li, J., Zhang, J., et al. (2008). The diploid genome sequence of an asian individual. Nature, 456(7218), 60.
Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., & Wang, J. (2009). Snp detection for massively parallel whole-genome resequencing. Genome Research, 19(6), 1124–1132.
Martin, E. R., Kinnamon, D., Schmidt, M. A., Powell, E., Zuchner, S., & Morris, R. (2010). Seqem: An adaptive genotype-calling approach for next-generation sequencing studies. Bioinformatics, 26(22), 2803–2810.
Compeau, P. E., Pevzner, P. A., & Tesler, G. (2011). How to apply de Bruijn graphs to genome assembly. Nature Biotechnology, 29(11), 987.
Kimura, K., & Koike, A. (2015). Ultrafast snp analysis using the burrows–wheeler transform of short-read data. Bioinformatics, 31(10), 1577–1583.
Pajuste, F.-D., Kaplinski, L., Möls, M., Puurand, T., Lepamets, M., & Remm, M. (2017). Fastgt: An alignment-free method for calling common SNVs directly from raw sequencing reads. Scientific Reports, 7(1), 2537.
Audano, P., Ravishankar, S., & Vannberg, F. (2017). Mapping-free variant calling using haplotype reconstruction from k-mer frequencies. Bioinformatics, 10, 1659–1665.
Gómez-Romero, L., Palacios-Flores, K., Reyes, J., García, D., Boege, M., Dávila, G., Flores, M., Schatz, M. C., & Palacios, R. (2018). Precise detection of de novo single nucleotide variants in human genomes. Proceedings of the National Academy of Sciences, 115(21), 5516–5521.
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S., et al. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), 15545–15550.
Patti, M. E., Butte, A. J., Crunkhorn, S., Cusi, K., Berria, R., Kashyap, S., Miyazaki, Y., Kohane, I., Costello, M., Saccone, R., et al. (2003). Coordinated reduction of genes of oxidative metabolism in humans with insulin resistance and diabetes: Potential role of PGC1 and NRF1. Proceedings of the National Academy of Sciences, 100(14), 8466–8471.
Petersen, K. F., Dufour, S., Befroy, D., Garcia, R., & Shulman, G. I. (2004). Impaired mitochondrial activity in the insulin-resistant offspring of patients with type 2 diabetes. New England Journal of Medicine, 350(7), 664–671.
Rubin, E. (June 2006). Circumventing the cut-off for enrichment analysis. Briefings in Bioinformatics, 7(2), 202–203. https://doi.org/10.1093/bib/bbl013.
Shi, J., & Walker, M. G. (2007). Gene set enrichment analysis (GSEA) for interpreting gene expression profiles. Current Bioinformatics, 2(2), 133–137.
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edger: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140.
Tarazona, S., García-Alcalde, F., Dopazo, J., Ferrer, A., & Conesa, A. (2011). Differential expression in rna-seq: A matter of depth. Genome Research, 21(12), 2213–2223.
Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., & Smyth, G. K. (2015). limma powers differential expression analyses for rna-sequencing and microarray studies. Nucleic Acids Research, 43(7), e47–e47.
Jiménez-Jacinto, V., Sánchez-Flores, A., & Vega-Alvarado, L. (2019). Integrative differential expression analysis for multiple experiments (ideamex): A web server tool for integrated rna-seq data analysis. Frontiers in Genetics, 10, 279.
Cai, L., Huang, H., Blackshaw, S., Liu, J. S., Cepko, C., & Wong, W. H. (2004). Clustering analysis of sage data using a poisson approach. Genome Biology, 5(7), R51.
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32(3), 241–254.
Hartigan, J. A. (1975). Clustering algorithms. New York: John Wiley & Sons, Inc.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S., & Golub, T. R. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences, 96(6), 2907–2912.
Banfield, J. D., & Raftery, A. E. (1993). Model-based gaussian and non-gaussian clustering. Biometrics, 49, 803–821.
McLachlan, Geoffrey J., Basford, & Kaye E. (1988). “Mixture models : inference and applications to clustering", New York, United States: Marcel Dekker. Vol 84. P. 253 p : ill.
Lin, X., Afsari, B., Marchionni, L., Cope, L., Parmigiani, G., Naiman, D., & Geman, D. (2009). The ordering of expression among a few genes can provide simple cancer biomarkers and signal brca1 mutations. BMC Bioinformatics, 10(1), 256.
Hwang, B., Lee, J. H., & Bang, D. (2018). Single-cell rna sequencing technologies and bioinformatics pipelines. Experimental & Molecular Medicine, 50(8), 96.
Lopez, R., Wang, R., & Seelig, G. (2018). A molecular multi-gene classifier for disease diagnostics. Nature Chemistry, 10(7), 746–754.
Huang, C.-C., & Lu, Z. (2016). Community challenges in biomedical text mining over 10 years: Success, failure and the future, Briefings in Bioinformatics, 17(1), 132–44.
Ananiadou, S., Thompson, P., Nawaz, R., McNaught, J., & Kell, D. B. (2014). Event-based text mining for biology and functional genomics. Briefings in Functional Genomics, 14(3), 213–230.
Weiss, S. M., Indurkhya, N., Zhang, T., & Damerau, F. (2005). Text mining: Predictive methods for analyzing unstructured information. Publisher Springer-Verlag New York.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
Ananiadou, S., & McNaught, J., Text mining for biology and biomedicine. London: Artech House (2006).
Liu, F., Chen, J., Jagannatha, A. N., & Yu, H. (2016). Learning for biomedical information extraction: Methodological review of recent advances. CoRR, abs/1606.07993.
Cho, H., Choi, W., & Lee, H. (2017). A method for named entity normalization in biomedical articles: Application to diseases and plants. BMC Bioinformatics, 18(1), 451.
Basaldella, M., Furrer, L., Tasso, C., & Rinaldi, F. (2017). Entity recognition in the biomedical domain using a hybrid approach. Journal of Biomedical Semantics, 8(1), 51.
Vanegas, J. A., Matos, S., González, F., & Oliveira, J. L. (2015). An overview of biomolecular event extraction from scientific documents. Computational and mathematical methods in medicine, Volume, 2015., Article ID 571381, 1–19. https://doi.org/10.1155/2015/571381
Chaix, E., Dubreucq, B., Fatihi, A., Valsamou, D., Bossy, R., Ba, M., Delėger, L., Zweigenbaum, P., Bessieres, P., Lepiniec, L., et al. (2016). Overview of the regulatory network of plant seed development (SeeDev) task at the BioNLP shared task 2016. In Proceedings of the 4th BioNLP Shared Task Workshop (pp. 1–11). ACL.
Björne, J., & Salakoski, T. (2018). Biomedical event extraction using convolutional neural networks and dependency parsing. In Proceedings of the BioNLP 2018 workshop (pp. 98–108).
Singh-Blom, U. M., Natarajan, N., Tewari, A., Woods, J. O., Dhillon, I. S., & Marcotte, E. M. (2013). Prediction and validation of gene-disease associations using methods inspired by social network analyses. PloS One, 8(5), e58977.
Chang, J. T., & Altman, R. B. (2004). Extracting and characterizing gene–drug relationships from the literature. Pharmacogenetics and Genomics, 14(9), 577–586.
Pletscher-Frankild, S., Pallejà, A., Tsafou, K., Binder, J. X., & Jensen, L. J. (2015). Diseases: Text mining and data integration of disease-gene associations. Methods, 74, 83–89.
Franceschini, A., Szklarczyk, D., Frankild, S., Kuhn, M., Simonovic, M., Roth, A., Lin, J., Minguez, P., Bork, P., Von Mering, C., et al. (2012). String v9. 1: Protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Research, 41(D1), D808–D815.
Tsuruoka, Y., Miwa, M., Hamamoto, K., Tsujii, J., & Ananiadou, S. (2011). Discovering and visualizing indirect associations between biomedical concepts. Bioinformatics, 27, i111–i119, 06.
Mahmood, A. A., Wu, T.-J., Mazumder, R., & Vijay-Shanker, K. (2016). Dimex: A text mining system for mutation-disease association extraction. PloS One, 11(4), e0152725.
Rindflesch, T. C., Libbus, B., Hristovski, D., Aronson, A. R., & Kilicoglu, H. (2003). Semantic relations asserting the etiology of genetic diseases. In AMIA Annual Symposium Proceedings (Vol. 2003, p. 554). American Medical Informatics Association.
Masseroli, M., Kilicoglu, H., Lang, F.-M., & Rindflesch, T. C. (2006). Argument-predicate distance as a filter for enhancing precision in extracting predications on the genetic etiology of disease. BMC Bioinformatics, 7(1), 291.
Greco, I., Day, N., Riddoch-Contreras, J., Reed, J., Soininen, H., Kłoszewska, I., Tsolaki, M., Vellas, B., Spenger, C., Mecocci, P., Wahlund, L.-O., Simmons, A., Barnes, J., & Lovestone, S. (2012). Alzheimer’s disease biomarker discovery using in silico literature mining and clinical validation. Journal of Translational Medicine, 10, 217.
Verspoor, K. M., Heo, G. E., Kang, K. Y., & Song, M. (2016). Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts. BMC Medical Informatics and Decision Making, 16(1), 68.
Bundschus, M., Dejori, M., Stetter, M., Tresp, V., & Kriegel, H.-P. (2008). Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics, 9(1), 207.
Björne, J., Heimonen, J., Ginter, F., Airola, A., Pahikkala, T., & Salakoski, T. (2009). Extracting complex biological events with rich graph-based feature sets. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task (pp. 10–18). Association for Computational Linguistics.
Liu, H., Hunter, L., Kešelj, V., & Verspoor, K. (2013). Approximate subgraph matching-based literature mining for biomedical events and relations. PLoS One, 8(4), e60954.
Bravo, À., Piñero, J., Queralt-Rosinach, N., Rautschka, M., & Furlong, L. I. (2015). Extraction of relations between genes and diseases from text and large-scale data analysis: Implications for translational research. BMC Bioinformatics, 16(1), 55.
Giuliano, C., Lavelli, A., & Romano, L. (2006). Exploiting shallow linguistic information for relation extraction from biomedical literature. In 11th Conference of the European Chapter of the Association for Computational Linguistics.
Kim, S., Yoon, J., & Yang, J. (2007). Kernel approaches for genic interaction extraction. Bioinformatics, 24(1), 118–126.
Singhal, A., Simmons, M., & Lu, Z. (2016). Text mining for precision medicine: Automating disease-mutation relationship extraction from biomedical literature. Journal of the American Medical Informatics Association, 23(4), 766–772.
Thompson, P., & Ananiadou, S. (2017). Extracting gene-disease relations from text to support biomarker discovery. In Proceedings of the 2017 International Conference on Digital Health (pp. 180–189). ACM.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Jiménez-Jacinto, V., Gómez-Romero, L., Méndez-Cruz, CF. (2020). Pattern Recognition Applied to the Analysis of Genomic Data and Its Association to Diseases. In: Ortiz-Posadas, M. (eds) Pattern Recognition Techniques Applied to Biomedical Problems. STEAM-H: Science, Technology, Engineering, Agriculture, Mathematics & Health. Springer, Cham. https://doi.org/10.1007/978-3-030-38021-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-38021-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38020-5
Online ISBN: 978-3-030-38021-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)