Skip to main content

Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data

  • Conference paper
  • First Online:
Conceptual Modeling (ER 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10650))

Included in the following conference series:

Abstract

Many repositories of open data for genomics, collected by world-wide consortia, are important enablers of biological research; moreover, all experimental datasets leading to publications in genomics must be deposited to public repositories and made available to the research community. These datasets are typically used by biologists for validating or enriching their experiments; their content is documented by metadata. However, emphasis on data sharing is not matched by accuracy in data documentation; metadata are not standardized across the sources and often unstructured and incomplete.

In this paper, we propose a conceptual model of genomic metadata, whose purpose is to query the underlying data sources for locating relevant experimental datasets. First, we analyze the most typical metadata attributes of genomic sources and define their semantic properties. Then, we use a top-down method for building a global-as-view integrated schema, by abstracting the most important conceptual properties of genomic sources. Finally, we describe the validation of the conceptual model by mapping it to three well-known data sources: TCGA, ENCODE, and Gene Expression Omnibus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    https://software.broadinstitute.org/firecloud/.

  2. 2.

    Data-Driven Genomic Computing, http://www.bioinformatics.deib.polimi.it/geco/, ERC Advanced Grant, 2016–2021.

  3. 3.

    At https://www.encodeproject.org/profiles/graph.svg see the conceptual model of ENCODE, an ER schema with tens of entities and hundreds of relationships, which is neither readable nor supported by metadata for most concepts.

  4. 4.

    http://www.bioinformatics.deib.polimi.it/GMQL/interfaces/.

  5. 5.

    We will use the BRENDA Tissue and Enzyme Source Ontology [32] for tissues, the Cell Line Ontology [31] for cell lines, and the Human Disease Ontology [33] for human diseases.

  6. 6.

    http://www.uniprot.org/uniprot/.

  7. 7.

    Textual analysis to extract semantic information from the GEO repository is reported in [12]; we plan to reuse their library.

  8. 8.

    The metadata is provided in the NCI Genomic Data Commons portal, https://docs.gdc.cancer.gov/Data_Dictionary/viewer/.

  9. 9.

    GEO information can be retrieved through the R package GEOmetadb [37].

References

  1. Adams, D., et al.: BLUEPRINT to decode the epigenetic signature written in blood. Nat. Biotechnol. 30(3), 224–226 (2012)

    Article  Google Scholar 

  2. Albrecht, F., et al.: DeepBlue epigenomic data server: programmatic data retrieval and analysis of epigenome. Nucleic Acids Res. 44(W1), W581–W586 (2016)

    Article  Google Scholar 

  3. Barrett, T., et al.: BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 40(D1), 57–63 (2012)

    Article  Google Scholar 

  4. Barrett, T., et al.: NCBI GEO: archive for functional genomics data sets – update. Nucleic Acids Res. 41(Database issue), D991–D995 (2013)

    Google Scholar 

  5. Bornberg-Bauer, E., Paton, N.W.: Conceptual data modelling for bioinformatics. Brief. Bioinform. 3(2), 166–180 (2002)

    Article  Google Scholar 

  6. Buneman, P., et al.: A data transformation system for biological data sources. In: International Conference on Very Large Data Bases, pp. 158–169 (1995)

    Google Scholar 

  7. Cumbo, F., et al.: TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas. BMC Bioinform. 18(6), 1–9 (2017)

    Google Scholar 

  8. Davidson, S.B., et al.: Biokleisli: a digital library for biomedical researchers. Int. J. Digit. Libr. 1(1), 36–53 (1997)

    Google Scholar 

  9. Davidson, S.B., et al.: K2/Kleisli and GUS: experiments in integrated access to genomic data sources. IBM Syst. J. 40(2), 512–531 (2001)

    Article  Google Scholar 

  10. El-Ghalayini, H., et al.: Deriving conceptual data models from domain ontologies for bioinformatics. In: 2006 2nd Information and Communication Technologies, ICTTA 2006, vol. 2, pp. 3562–3567 (2006)

    Google Scholar 

  11. Fernández, J.D., et al.: Ontology-based search of genomic metadata. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 233–247 (2016)

    Article  Google Scholar 

  12. Galeota, E., Pelizzola, M.: Ontology-based annotations and semantic relations in large-scale (epi)genomics data. Brief. Bioinform. 18(3), 403–412 (2017)

    Google Scholar 

  13. Haider, S., et al.: BioMart Central Portal - unified access to biological data. Nucleic Acids Res. 37(Web Server issue), 23–27 (2009)

    Article  Google Scholar 

  14. Hernandez, T., Kambhampati, S.: Integration of biological sources: current systems and challenges ahead. SIGMOD Rec. 33(3), 51–60 (2004)

    Article  Google Scholar 

  15. Idrees, M., et al.: A review: conceptual data models for biological domain. JAPS, J. Anim. Plant Sci. 25(2), 337–345 (2015)

    Google Scholar 

  16. Ji, F., Elmasri, R., et al.: Incorporating concepts for bioinformatics data modeling into EER models. In: ACS/IEEE International Conference on Computer Systems and Applications, pp. 189–192. IEEE Computer Society, Washington, DC, USA (2005)

    Google Scholar 

  17. Kaitoua, A., Pinoli, P., Bertoni, M., Ceri, S.: Framework for supporting genomic operations. IEEE Trans. Comput. 66(3), 443–457 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  18. Keet, M.C.: Biological data and conceptual modelling method. J. Concept. Model. 29(1), 1–14 (2003)

    Google Scholar 

  19. Kundaje, A., et al.: Integrative analysis of 111 reference human epigenomes. Nature 518(7539), 317–330 (2015)

    Article  Google Scholar 

  20. Lenzerini, M.: Data integration: a theoretical perspective. In: Symposium on Principles of Database Systems, PODS, pp. 233–246. ACM, New York, NY, USA (2002)

    Google Scholar 

  21. Louie, B., et al.: Data integration and genomic medicine. J. Biomed. Inform. 40(1), 5–16 (2007)

    Article  Google Scholar 

  22. Masseroli, M., Canakoglu, A., Ceri, S.: Integration and querying of genomic and proteomic semantic annotations for biomedical knowledge extraction. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 209–219 (2016)

    Article  Google Scholar 

  23. Masseroli, M., et al.: GenoMetric Query Language: a novel approach to large-scale genomic data management. Bioinformatics 31(12), 1881–1888 (2015)

    Article  Google Scholar 

  24. Masseroli, M., et al.: Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods 111, 3–11 (2016)

    Article  Google Scholar 

  25. Rechenmann, F.: Data modeling: the key to biological data integration. EMBnet. J. 18(B), 59–60 (2012)

    Article  Google Scholar 

  26. Anonymous paper. Accelerating bioinformatics research with new software for big data to knowledge (BD2K), Paradigm4, April 2015. www.paradigm4.com

  27. Consortium 1000Genomes: A map of human genome variation from population-scale sequencing. Nature 467(7319), 1061–1073 (2010)

    Article  Google Scholar 

  28. Consortium ENCODE: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)

    Article  Google Scholar 

  29. Reyes Román, J.F., Pastor, Ó., Casamayor, J.C., Valverde, F.: Applying conceptual modeling to better understand the human genome. In: Comyn-Wattiau, I., Tanaka, K., Song, I.-Y., Yamamoto, S., Saeki, M. (eds.) ER 2016. LNCS, vol. 9974, pp. 404–412. Springer, Cham (2016). doi:10.1007/978-3-319-46397-1_31

    Chapter  Google Scholar 

  30. Roy, A., et al.: Massively parallel processing of whole genome sequence data: an in-depth performance study. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD 2017, Chicago, Illinois, USA, 14–19 May 2017, pp. 187–202. ACM, New York (2017)

    Google Scholar 

  31. Sarntivijai, S., et al.: CLO: the cell line ontology. J. Biomed. Semant. 5(1), 37 (2014)

    Article  Google Scholar 

  32. Schomburg, I., et al.: BRENDA in 2013: new options and contents in BRENDA. Nucleic Acids Res. 41(Database issue), D764–D772 (2013)

    Google Scholar 

  33. Schriml, L.M., et al.: Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res. 40(Database issue), 940–946 (2012)

    Article  Google Scholar 

  34. Smedley, D., et al.: The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 43(W1), 589–598 (2015)

    Article  Google Scholar 

  35. Wang, L., et al.: BioStar models of clinical and genomic data for biomedical data warehouse design. Int. J. Bioinform. Res. Appl. 1(1), 63–80 (2005)

    Article  MathSciNet  Google Scholar 

  36. Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013)

    Article  Google Scholar 

  37. Zhu, Y., et al.: Geometadb: powerful alternative search engine for the gene expression omnibus. Bioinformatics 24(23), 2798–2800 (2008)

    Article  Google Scholar 

Download references

Acknowledgement

This research is funded by the ERC Advanced Grant project GeCo (Data-Driven Genomic Computing), 2016–2021.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anna Bernasconi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Bernasconi, A., Ceri, S., Campi, A., Masseroli, M. (2017). Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data. In: Mayr, H., Guizzardi, G., Ma, H., Pastor, O. (eds) Conceptual Modeling. ER 2017. Lecture Notes in Computer Science(), vol 10650. Springer, Cham. https://doi.org/10.1007/978-3-319-69904-2_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-69904-2_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69903-5

  • Online ISBN: 978-3-319-69904-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics