Skip to main content

Aligning Biomedical Metadata with Ontologies Using Clustering and Embeddings

  • Conference paper
  • First Online:
Book cover The Semantic Web (ESWC 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11503))

Included in the following conference series:

Abstract

The metadata about scientific experiments published in online repositories have been shown to suffer from a high degree of representational heterogeneity—there are often many ways to represent the same type of information, such as a geographical location via its latitude and longitude. To harness the potential that metadata have for discovering scientific data, it is crucial that they be represented in a uniform way that can be queried effectively. One step toward uniformly-represented metadata is to normalize the multiple, distinct field names used in metadata (e.g., lat lon, lat and long) to describe the same type of value. To that end, we present a new method based on clustering and embeddings (i.e., vector representations of words) to align metadata field names with ontology terms. We apply our method to biomedical metadata by generating embeddings for terms in biomedical ontologies from the BioPortal repository. We carried out a comparative study between our method and the NCBO Annotator, which revealed that our method yields more and substantially better alignments between metadata and ontology terms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Barrett, T., et al.: BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucl. Acids Res. 40, D57–D63 (2012)

    Article  Google Scholar 

  2. Ester, M., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Conference on Knowledge Discovery and Data Mining (1996)

    Google Scholar 

  3. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  4. Goldberg, Y., Levy, O.: Word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014)

  5. Gonçalves, R.S., Musen, M.A.: The variable quality of metadata about biological samples used in biomedical experiments. Sci. Data 6, 190021 (2018)

    Article  Google Scholar 

  6. Jiménez-Ruiz, E., Cuenca Grau, B.: LogMap: logic-based and scalable ontology matching. In: Aroyo, L., et al. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 273–288. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25073-6_18

    Chapter  Google Scholar 

  7. Jonquet, C., et al.: NCBO annotator: semantic annotation of biomedical data. In: International Semantic Web Conference (2009)

    Google Scholar 

  8. Kamdar, M.R., et al.: An empirical meta-analysis of the life sciences (linked?) open data cloud (2018). http://onto-apps.stanford.edu/lslodminer

  9. Koster, C., Seutter, M., Seibert, O.: Parsing the medline corpus. In: Recent Advances in Natural Language Processing (2007)

    Google Scholar 

  10. Lin, Y., et al.: Learning entity and relation embeddings for knowledge graph completion. In: AAAI Conference on Artificial Intelligence (2015)

    Google Scholar 

  11. McInnes, L., Healy, J., Astels, S.: HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2(11), 205 (2017)

    Article  Google Scholar 

  12. Noy, N.F., et al.: BioPortal: ontologies and integrated data resources at the click of a mouse. Nucl. Acids Res. 37, W170–W173 (2009)

    Article  Google Scholar 

  13. Passos, A., Kumar, V., McCallum, A.: Lexicon infused phrase embeddings for named entity resolution. arXiv preprint arXiv:1404.5367 (2014)

  14. Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (2014)

    Google Scholar 

  15. Percha, B., Altman, R.B., Wren, J.: A global network of biomedical relationships derived from text. Bioinformatics 1, 11 (2018)

    Google Scholar 

  16. Ristoski, P., Paulheim, H.: RDF2Vec: RDF graph embeddings for data mining. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 498–514. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46523-4_30

    Chapter  Google Scholar 

  17. Shah, N.H., et al.: Comparison of concept recognizers for building the open biomedical annotator. In: BMC Bioinformatics, vol. 10, p. S14. BioMed Central (2009)

    Google Scholar 

  18. Smaili, F.Z., Gao, X., Hoehndorf, R.: OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction. arXiv preprint arXiv:1804.10922 (2018)

  19. Socher, R., et al.: Reasoning with neural tensor networks for knowledge base completion. In: Advances in Neural Information Processing Systems (2013)

    Google Scholar 

  20. Wang, Y., et al.: A comparison of word embeddings for the biomedical natural language processing. J. Biomed. Inform. 87, 12–20 (2018)

    Article  Google Scholar 

  21. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on hyperplanes. In: AAAI Conference on Artificial Intelligence (2014)

    Google Scholar 

  22. Wilkinson, M.D., et al.: The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016)

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by grant U54 AI117925 awarded by the U.S. National Institute of Allergy and Infectious Diseases (NIAID) through funds provided by the Big Data to Knowledge (BD2K) initiative. BioPortal has been supported by the NIH Common Fund under grant U54 HG004028.

We thank the experts in our evaluation panel: John Graybeal, Josef Hardi, Marcos Martínez-Romero, and Csongor Nyulas (all of whom from the Center for Biomedical Informatics Research at Stanford University), for their participation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rafael S. Gonçalves .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gonçalves, R.S., Kamdar, M.R., Musen, M.A. (2019). Aligning Biomedical Metadata with Ontologies Using Clustering and Embeddings. In: Hitzler, P., et al. The Semantic Web. ESWC 2019. Lecture Notes in Computer Science(), vol 11503. Springer, Cham. https://doi.org/10.1007/978-3-030-21348-0_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-21348-0_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-21347-3

  • Online ISBN: 978-3-030-21348-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics