Challenges and Cases of Genomic Data Integration Across Technologies and Biological Scales

Samarajiwa, Shamith A.; Olan, Ioana; Bihary, Dóra

doi:10.1007/978-3-319-77911-9_12

Challenges and Cases of Genomic Data Integration Across Technologies and Biological Scales

Shamith A. Samarajiwa⁶,
Ioana Olan⁷ &
Dóra Bihary⁶

Chapter
First Online: 21 April 2018

1122 Accesses
1 Citations
1 Altmetric

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 93))

Abstract

Current technological advancements have facilitated novel experimental methods that measure a diverse assortment of biological processes, creating a data deluge in biology and medicine. This proliferation of data sources, from large repositories and data warehouses to specialist databases that store a variety of different data types, contributing to a multitude of different file formats, have necessitated minimal data standards that describe both data and annotation. In addition to integrating at the data resource level, development of integrative computational or statistical methods that explore two or more data types or biological layers to understand their joint influence can lead to a better understanding of both normal and pathological processes. Combination of these different data-layers, in turn enables us to glean a more integrative understanding of complex biological systems. Development of integrative methods that bridge both biology and technology can provide insight into different scales of gene and genome regulation. Some of these integrative approaches and their application are explored in this chapter in the context of modern genomics.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Batch effects are sources of technical variation that have been added to samples during handling and processing, such as when samples belonging to the same experiment are processed at different times, produced with different reagent batches, on different machines or by different people.
2.
The epigenome consists of a collection of chemical compounds that tell the DNA what to do. These can attach to DNA or proteins associated with DNA and regulate gene activity without changing the DNA sequence.
3.
Chromatin consists of DNA, the disk like nucleosomes that DNA spools around for efficient packaging, non-coding RNA and other DNA associated accessory proteins. When epigenomic compounds attach to chromatin, they are said to have “marked” the genome. These modifications do not change the sequence of the DNA, they change how cells use the information encoded by DNA.

References

Bader GD, Cary MP, Sander C (2006) Pathguide: a pathway resource list. Nucleic Acids Res 34:D504–D506. https://doi.org/10.1093/nar/gkj126
Article Google Scholar
Bednar J, Horowitz RA, Grigoryev SA et al (1998) Nucleosomes, linker DNA, and linker histone form a unique structural motif that directs the higher-order folding and compaction of chromatin. Proc Natl Acad Sci U S A 95:14173–14178. https://doi.org/10.1073/pnas.95.24.14173
Article Google Scholar
Berners-Lee T. (2006) Linked Data Design Issues. http://www.w3.org/DesignIssues/LinkedData.html. Accessed 30 June 2017
Benson DA, Cavanaugh M, Clark K et al (2017) GenBank. Nucleic Acids Res 45:D37–D42. https://doi.org/10.1093/nar/gkw1070
Article Google Scholar
BioMart (2009) https://www.biomart.org. Accessed 30 June 2017
Biosharing (2016) https://biosharing.org. Accessed 30 June 2017
Brazma A (2009) Minimum information about a microarray experiment (MIAME)–successes, failures, challenges. SciWorld J 9:420–423. https://doi.org/10.1100/tsw.2009.57
Article Google Scholar
Brown PO, Botstein D (1999) Exploring the new world of the genome with DNA microarrays. Nat Genet 21:33–37. https://doi.org/10.1038/4462
Article Google Scholar
Cairns J (2012) Rcade: a tool for integrating a count-based ChIP-seq analysis with differential expression summary data. R package version 1.16.0
Google Scholar
Casper J, Zweig AS, Villarreal C et al (2017) The UCSC Genome browser database: 2018 update. Nucleic Acids Res. https://doi.org/10.1093/nar/gkx1020
Article Google Scholar
Chen H, Yu T, Chen JY (2013) Semantic web meets integrative biology: a survey. Brief Bioinform 14:109–125. https://doi.org/10.1093/bib/bbs014
Article Google Scholar
Ching T, Huang S, Garmire LX (2014) Power analysis and sample size estimation for RNA-Seq differential expression. RNA 20:1684–1696. https://doi.org/10.1261/rna.046011.114
Article Google Scholar
Cremer T, Cremer C (2001) Chromosome territories, nuclear architecture and gene regulation in mammalian cells. Nat Rev Genet 2:292–301. https://doi.org/10.1038/35066075
Article Google Scholar
Crowdflower (2016) Crowdflower Data Science Report 2016. http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf. Accessed 30 June 2017
Dekker J, Mirny L (2016) The 3D genome as moderator of chromosomal communication. Cell 164:1110–1121. https://doi.org/10.1016/j.cell.2016.02.007
Article Google Scholar
Durinck S, Spellman PT, Birney E, Huber W (2009) Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc 4:1184–1191. https://doi.org/10.1038/nprot.2009.97
Article Google Scholar
Ernst J, Kellis M (2012) ChromHMM: automating chromatin-state discovery and characterization. Nat Methods 9:215–216. https://doi.org/10.1038/nmeth.1906
Article Google Scholar
Fillbrunn A, Dietz C, Pfeuffer J et al (2017) KNIME for reproducible cross-domain analysis of life science data. J Biotechnol 261:149–156. https://doi.org/10.1016/j.jbiotec.2017.07.028
Article Google Scholar
Flavahan WA, Drier Y, Liau BB et al (2016) Insulator dysfunction and oncogene activation in IDH mutant gliomas. Nature 529:110–114. https://doi.org/10.1038/nature16490
Article Google Scholar
Functional Genomics Data Society (2010) http://fged.org. Accessed 30 June 2017
Galperin MY, Fernández-Suárez XM, Rigden DJ (2017) The 24th annual nucleic acids research database issue: a look back and upcoming changes. Nucleic Acids Res 45:5627. https://doi.org/10.1093/nar/gkx021
Article Google Scholar
Giardine B, Riemer C, Hardison RC et al (2005) Galaxy: a platform for interactive large-scale genome analysis. Genome Res 15:1451–1455. https://doi.org/10.1101/gr.4086505
Article Google Scholar
Giorgetti L, Lajoie BR, Carter AC et al (2016) Structural organization of the inactive X chromosome in the mouse. Nature 535:575–579. https://doi.org/10.1038/nature18589
Article Google Scholar
Gligorijević V, Malod-Dognin N, Pržulj N (2016) Integrative methods for analyzing big data in precision medicine. Proteomics 16:741–758. https://doi.org/10.1002/pmic.201500396
Article Google Scholar
Goble C, Stevens R (2008) State of the nation in data integration for bioinformatics. J Biomed Inform 41:687–693. https://doi.org/10.1016/j.jbi.2008.01.008
Article Google Scholar
Henry VJ, Bandrowski AE, Pepin A-S et al (2014) OMICtools: an informative directory for multi-omic data analysis. Database (Oxford). https://doi.org/10.1093/database/bau069
Article Google Scholar
Hoffman MM, Buske OJ, Wang J et al (2012) Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods 9:473–476. https://doi.org/10.1038/nmeth.1937
Article Google Scholar
Hood L, Rowen L (2013) The Human Genome Project: big science transforms biology and medicine. Genome Med 5:79. https://doi.org/10.1186/gm483
Article Google Scholar
Horbach SPJM, Halffman W (2017) The ghosts of HeLa: how cell line misidentification contaminates the scientific literature. PLoS ONE 12:e0186281. https://doi.org/10.1371/journal.pone.0186281
Article Google Scholar
Hull D, Wolstencroft K, Stevens R et al (2006) Taverna: a tool for building and running workflows of services. Nucleic Acids Res 34:W729–W732. https://doi.org/10.1093/nar/gkl320
Article Google Scholar
Illumina Press Release (2017) https://www.illumina.com/company/news-center/press-releases/press-release-details.html%3Fnewsid%3D2236383
Jenkinson AM, Albrecht M, Birney E et al (2008) Integrating biological data–the distributed annotation system. BMC Bioinform 9(Suppl 8):S3. https://doi.org/10.1186/1471-2105-9-S8-S3
Article Google Scholar
Kalderimis A, Lyne R, Butano D et al (2014) InterMine: extensive web services for modern biology. Nucleic Acids Res 42:W468–W472. https://doi.org/10.1093/nar/gku301
Article Google Scholar
Kirschner K, Samarajiwa SA, Cairns JM et al (2015) Phenotype specific analyses reveal distinct regulatory mechanism for chronically activated p53. PLoS Genet 11:e1005053. https://doi.org/10.1371/journal.pgen.1005053
Article Google Scholar
Landfors M, Philip P, Rydén P, Stenberg P (2011) Normalization of high dimensional genomics data where the distribution of the altered variables is skewed. PLoS ONE 6:e27942. https://doi.org/10.1371/journal.pone.0027942
Article Google Scholar
Leek JT (2014) svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. https://doi.org/10.1093/nar/gku864
Article Google Scholar
Lieberman-Aiden E, van Berkum NL, Williams L et al (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326:289–293. https://doi.org/10.1126/science.1181369
Article Google Scholar
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15:550. https://doi.org/10.1186/s13059-014-0550-8
Article Google Scholar
Luger K, Dechassa ML, Tremethick DJ (2012) New insights into nucleosome and chromatin structure: an ordered state or a disordered affair? Nat Rev Mol Cell Biol 13:436–447. https://doi.org/10.1038/nrm3382
Article Google Scholar
Mammana A, Chung H-R (2015) Chromatin segmentation based on a probabilistic model for read counts explains a large portion of the epigenome. Genome Biol 16:151. https://doi.org/10.1186/s13059-015-0708-z
Article Google Scholar
Martínez-Bartolomé S, Binz P-A, Albar JP (2014) The minimal information about a proteomics experiment (MIAPE) from the proteomics standards initiative. Methods Mol Biol 1072:765–780. https://doi.org/10.1007/978-1-62703-631-3_53
Article Google Scholar
McQuilton P, Gonzalez-Beltran A, Rocca-Serra P et al (2016) BioSharing: curated and crowd-sourced metadata standards, databases and data policies in the life sciences. Database (Oxford). https://doi.org/10.1093/database/baw075
Article Google Scholar
Merali Z, Giles J (2005) Databases in peril. Nature 435:1010–1011. https://doi.org/10.1038/4351010a
Article Google Scholar
Morgan M, Carlson M, Tenenbaum D and Arora S (2017). AnnotationHub: Client to access AnnotationHub resources. R package version 2.6.5
Google Scholar
National Centre for Biotechnology Information (1988) Bethesda (MD): National Library of Medicine (US), https://www.ncbi.nlm.nih.gov/NLM. Accessed 30 June 2017 (NCBI)
OmicTools (2014), https://omictools.com/. Accessed 30 June 2017
Pasquier C (2008) Biological data integration using semantic web technologies. Biochimie 90:584–594. https://doi.org/10.1016/j.biochi.2008.02.007
Article Google Scholar
Pearson H (2001) Biology’s name game. Nature 411:631–632. https://doi.org/10.1038/35079694
Article Google Scholar
Pepke S, Wold B, Mortazavi A (2009) Computation for ChIP-seq and RNA-seq studies. Nat Methods 6:S22–S32. https://doi.org/10.1038/nmeth.1371
Article Google Scholar
Pathguide: The pathway resource list (2006) TP53 knowledge based network models. http://www.pathguide.org. Accessed 30 June 2017
Robertson G, Hirst M, Bainbridge M et al (2007) Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 4:651–657. https://doi.org/10.1038/nmeth1068
Article Google Scholar
Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140. https://doi.org/10.1093/bioinformatics/btp616
Article Google Scholar
Samarajiwa SA (2015) TP53 knowledge-based network models. http://australian-systemsbiology.org/tp53/. Accessed 30 June 2017
Samarajiwa SA, Forster S, Auchettl K, Hertzog PJ (2009) INTERFEROME: the database of interferon regulated genes. Nucleic Acids Res 37:D852–D857. https://doi.org/10.1093/nar/gkn732
Article Google Scholar
Sawyer IA, Dundr M (2017) Chromatin loops and causality loops: the influence of RNA upon spatial nuclear architecture. Chromosoma 1–17. https://doi.org/10.1007/s00412-017-0632-y
Schadt EE, Linderman MD, Sorenson J et al (2010) Computational solutions to large-scale data management and analysis. Nat Rev Genet 11:647–657. https://doi.org/10.1038/nrg2857
Article Google Scholar
Smedley D, Haider S, Durinck S et al (2015) The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res 43:W589–W598. https://doi.org/10.1093/nar/gkv350
Article Google Scholar
Stein L (2002) Creating a bioinformatics nation. Nature 417:119–120. https://doi.org/10.1038/417119a
Article Google Scholar
Stephens ZD, Lee SY, Faghri F et al (2015) Big data: astronomical or genomical? PLoS Biol 13:e1002195. https://doi.org/10.1371/journal.pbio.1002195
Article Google Scholar
Taylor CF, Field D, Sansone S-A et al (2008) Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotechnol 26:889–896. https://doi.org/10.1038/nbt.1411
Article Google Scholar
Wang S, Sun H, Ma J et al (2013) Target analysis by integration of transcriptome and ChIP-seq data with BETA. Nat Protoc 8:2502–2515. https://doi.org/10.1038/nprot.2013.150
Article Google Scholar
Yates B, Braschi B, Gray KA et al (2017) Genenames.org: the HGNC and VGNC resources in 2017. Nucleic Acids Res 45:D619–D625. https://doi.org/10.1093/nar/gkw1033
Article Google Scholar
Yu L, Fernandez S, Brock G (2017) Power analysis for RNA-Seq differential expression studies. BMC Bioinformatics 18:234. https://doi.org/10.1186/s12859-017-1648-2
Article Google Scholar

Download references

Author information

Authors and Affiliations

MRC Cancer Unit, University of Cambridge, Cambridge, CB2 0XZ, UK
Shamith A. Samarajiwa & Dóra Bihary
Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, CB2 0RE, UK
Ioana Olan

Authors

Shamith A. Samarajiwa
View author publications
You can also search for this author in PubMed Google Scholar
Ioana Olan
View author publications
You can also search for this author in PubMed Google Scholar
Dóra Bihary
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shamith A. Samarajiwa .

Editor information

Editors and Affiliations

Computer Science Department, Furman University, Greenville, South Carolina, USA
Philippe J. Giabbanelli
Department of Computer Science, Lakehead University, Thunder Bay, Ontario, Canada
Vijay K. Mago
Department of Computer Engineering, Technological Educational Institute, Lamia, Greece
Elpiniki I. Papageorgiou

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Samarajiwa, S.A., Olan, I., Bihary, D. (2018). Challenges and Cases of Genomic Data Integration Across Technologies and Biological Scales. In: Giabbanelli, P., Mago, V., Papageorgiou, E. (eds) Advanced Data Analytics in Health. Smart Innovation, Systems and Technologies, vol 93. Springer, Cham. https://doi.org/10.1007/978-3-319-77911-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-77911-9_12
Published: 21 April 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77910-2
Online ISBN: 978-3-319-77911-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics