Abstract
Data integration is one of the most challenging research topic in many knowledge domains, and biology is surely one of them. However theory and state of the art methods make this task complex for most of the small research centers. Fortunately, several organizations are focusing on collecting heterogeneous data making an easier task to design analysis tools and test biological and medical hypothesis on integrated data. One of the most evident case of such efforts is The Cancer Genome Atlas (TCGA), a data base that contains a large variety of information related to different types of cancer. This data base offers a great opportunity to those interested in performing analysis of integrated data; however, its exploitation is not so easy since non trivial efforts are required to extract and combine data before it could be analyzed in an integrated perspective. In this paper we present IRIS-TCGA, an online web service developed to perform multiple queries for data integration on TCGA. Differently from other tools that have been proposed to interact with TCGA, IRIS-TCGA allows a direct access to the data and enables to extract detailed combinations of subsets of the repository, according to filters and high-order queries. The structure of the system is simple, as it is built on two main operators, union and intersection, that are then used to construct queries of higher complexity. The first version of the system supports the extraction and integration of gene expression (RNA-sequencing, microarrays), DNA-methylation, and DNA-sequencing (mutations) data from experiments on tissues of patients, together with their related meta data, in a gene oriented organization. The extracted data matrices are particularly suited for data mining applications (e.g., classification). Finally, we show two application examples, where IRIS-TCGA is used for integrating genomic data from RNA-sequencing and DNA-methylation experiments, and where state-of-the-art bioinformatics analysis tools are applied to the integrated data in order to extract new knowledge from them. IRIS-TCGA is freely available at http://bioinf.iasi.cnr.it/iristcga/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Gomez-Cabrero, D., Abugessaisa, I., Maier, D., Teschendorff, A., Merkenschlager, M., Gisel, A., Ballestar, E., Bongcam-Rudloff, E., Conesa, A., Tegnér, J.: Data integration in the era of omics: current and future challenges. BMC Syst. Biol. 8(Suppl 2), I1 (2014)
Hayden, E.C.: Technology: the $1,000 genome. Nature 507(7492), 294–5 (2014)
Weitschek, E., Santoni, D., Fiscon, G., De Cola, M.C., Bertolazzi, P., Felici, G.: Next generation sequencing reads comparison with an alignment-free distance. BMC Res. Notes 7(1), 869 (2014)
Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R.M., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M., Network, T.: The cancer genome atlas pan-cancer analysis project. Nature Genet. 45(10), 1113–1120 (2013)
Ovaska, K., Laakso, M., Haapa-Paananen, S., Louhimo, R., Chen, P., Aittomaki, V., Valo, E., Nunez-Fontarnau, J., Rantanen, V., Karinen, S., et al.: Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme. Genome Med. 2(9), 65 (2010)
Joly, Y., Dove, E.S., Knoppers, B.M., Bobrow, M., Chalmers, D.: Data sharing in the post-genomic world: the experience of the international cancer genome consortium (ICGC) data access compliance office (daco). PLoS Comput. Biol. 8(7), e1002549 (2012)
Cerami, E., Gao, J., Dogrusoz, U., Gross, B.E., Sumer, S.O., Aksoy, B.A., Jacobsen, A., Byrne, C.J., Heuer, M.L., Larsson, E., et al.: The CBIO cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Disc. 2(5), 401–404 (2012)
Zhu, Y., Qiu, P., Ji, Y.: TCGA-assembler: open-source software for retrieving and processing TCGA data. Nat. Methods 11(6), 599–600 (2014)
Colaprico, A., Silva, T.C., Olsen, C., Garofano, L., Cava, C., Garolini, D., Sabedot, T.S., Malta, T.M., Pagnotta, S.M., Castiglioni, I., et al.: TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucl. Acids Res. 44, e71 (2015)
Deng, M., Brägelmann, J., Schultze, J.L., Perner, S.: Web-TCGA: an online platform for integrated analysis of molecular cancer data sets. BMC Bioinform. 17(1), 1 (2016)
Weitschek, E., Felici, G., Bertolazzi, P.: Clinical data mining: problems, pitfalls and solutions. In: 24th International Workshop on Database and Expert Systems Application, pp. 90–94, 10662 Los Vaqueros Circle, Los Alamitos, CA 90720, USA. IEEE Computer Society (2013)
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5(7), 621–628 (2008)
Li, B., Dewey, C.N.: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12(1), 323 (2011)
Weitschek, E., Felici, G., Bertolazzi, P.: Mala: a microarray clustering and classification software. In: 23rd International Workshop on Database and Expert Systems Application, pp. 201–205, 10662 Los Vaqueros Circle, Los Alamitos, CA 90720, USA. IEEE Computer Society (2012)
Bird, A.P.: CpG-rich islands and the function of DNA methylation. Nature 321(6067), 209–213 (1985)
Bird, A.: DNA methylation patterns and epigenetic memory. Genes Dev. 16(1), 6–21 (2002)
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., et al.: The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9), 1297–1303 (2010)
Weitschek, E., Cumbo, F., Cappelli, E., Felici, G.: Genomic data integration: a case study on next generation sequencing of cancer. In: 27th International Workshop on Database and Expert Systems Application, pp. 49–53, 10662 Los Vaqueros Circle, Los Alamitos, CA 90720, USA. IEEE Computer Society (2016)
Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123. Morgan Kaufmann (1995)
Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, New York (1995)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000)
Bibikova, M., Barnes, B., Tsan, C., Ho, V., Klotzle, B., Le, J.M., Delano, D., Zhang, L., Schroth, G.P., Gunderson, K.L., et al.: High density dna methylation array with single cpg site resolution. Genomics 98(4), 288–295 (2011)
Weitschek, E., Fiscon, G., Felici, G.: Supervised DNA Barcodes species classification: analysis, comparisons and results. BioData Mining 7(1), 1 (2014)
Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley, Boston (2005). 75 Arlington Street, Suite 300
Cestarelli, V., Fiscon, G., Felici, G., Bertolazzi, P., Weitschek, E.: CAMUR: Knowledge extraction from RNA-seq cancer data through equivalent classification rules. Bioinformatics 32(5), 697–704 (2016)
Dandrea, D., Grassi, L., Mazzapioda, M., Tramontano, A.: Fidea: a server for the functional interpretation of differential expression analysis. Nucl. Acids Res. 41(W1), W84–W88 (2013)
Khatri, P., Sirota, M., Butte, A.J.: Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput. Biol. 8(2), e1002375 (2012)
Kulis, M., Heath, S., Bibikova, M., Queirós, A.C., Navarro, A., Clot, G., MartÃnez-Trillos, A., Castellano, G., Brun-Heath, I., Pinyol, M., et al.: Epigenomic analysis detects widespread gene-body DNA hypomethylation in chronic lymphocytic leukemia. Nature Genet. 44(11), 1236–1242 (2012)
Chen, C., Zhang, C., Cheng, L., Reilly, J.L., Bishop, J.R., Sweeney, J.A., Chen, H.Y., Gershon, E.S., Liu, C.: Correlation between DNA methylation and gene expression in the brains of patients with bipolar disorder and schizophrenia. Bipolar Disorders 16(8), 790–799 (2014)
Akalin, A., Garrett-Bakelman, F.E., Kormaksson, M., Busuttil, J., Zhang, L., Khrebtukova, I., Milne, T.A., Huang, Y., Biswas, D., Hess, J.L., et al.: Base-pair resolution DNA methylation sequencing reveals profoundly divergent epigenetic landscapes in acute myeloid leukemia. PLoS Genet. 8(6), e1002781 (2012)
Maunakea, A.K., Nagarajan, R.P., Bilenky, M., Ballinger, T.J., DSouza, C., Fouse, S.D., Johnson, B.E., Hong, C., Nielsen, C., Zhao, Y., et al.: Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature 466(7303), 253–257 (2010)
Seber, G.A., Lee, A.J.: Linear Regression Analysis, vol. 936. Wiley, Hoboken (2012). 07030–5774
Conrad, D.F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O., Zhang, Y., Aerts, J., Andrews, T.D., Barnes, C., Campbell, P., et al.: Origins and functional impact of copy number variation in the human genome. Nature 464(7289), 704–712 (2010)
Zeng, Y., Cullen, B.R.: Sequence requirements for micro RNA processing and function in human cells. RNA 9(1), 112–123 (2003)
Blankenberg, D., Kuster, G.V., Coraor, N., Ananda, G., Lazarus, R., Mangan, M., Nekrutenko, A., Taylor, J.: Galaxy: a web-based genome analysis tool for experimentalists. Current Protocols Mol. Biol. 19, 1–21 (2010)
Acknowledgments
The results reported here are based upon the data generated by the TCGA Research Network: http://cancergenome.nih.gov/.
Funding
The work was financially supported by the SysBioNet, Italian Roadmap Research Infrastructure, and the Epigenomics Flagship Project EPIGEN [PB.P01].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Cumbo, F., Weitschek, E., Bertolazzi, P., Felici, G. (2017). IRIS-TCGA: An Information Retrieval and Integration System for Genomic Data of Cancer. In: Bracciali, A., Caravagna, G., Gilbert, D., Tagliaferri, R. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2016. Lecture Notes in Computer Science(), vol 10477. Springer, Cham. https://doi.org/10.1007/978-3-319-67834-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-67834-4_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67833-7
Online ISBN: 978-3-319-67834-4
eBook Packages: Computer ScienceComputer Science (R0)