Abstract
This is a report on ongoing work done in a research project for Small and Medium-sized Enterprises (SMEs), funded by the German Federal Ministry of Education and Research (Funding ID: 01IS15056D; project duration: Jan 2016 – Dec 2017). The project, named OntoPMS, is targeted at post market surveillance (PMS) of medical devices as required by the medical device regulation (Medical Device Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices, OJ. L, pp 1–175, 2017) which entered into power following formal publication in May 2017. Being a regulation, it is immediately legally binding in all member states of the European Union. This project aims at providing both technical support and assisting procedures to satisfy article 4 of the MDR: “Key elements of the existing regulatory approach, such as the supervision of notified bodies, conformity assessment procedures, clinical investigations and clinical evaluation, vigilance and market surveillance should be significantly reinforced, whilst provisions ensuring transparency and traceability regarding medical devices should be introduced, to improve health and safety.” This chapter focuses on one component of the software system under development, the corpus builder. This component retrieves scientific publications of interest from the web and other sources, checks them for relevance and transfers them to a linguistic corpus and in parallel to a search engine based on the open source package Elasticsearch. The challenge was, in this case, not to take everything that one can get hold of (whole web crawling) but to find and to take only those publications that really belong to the domain of interest and are relevant with respect to surveillance aspects. So, the dictum was to build comprehensive yet minimal corpora for the purposes at hand. Although the software has been developed in the context of medical device PMS, its use is not bound in any way to this specific application area.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Web Crawler: https://en.wikipedia.org/wiki/Web_crawler.
- 2.
Unified Resource Locator: https://en.wikipedia.org/wiki/URL.
- 3.
Web Scraping: https://en.wikipedia.org/wiki/Web_scraping.
- 4.
Inverted Index: https://en.wikipedia.org/wiki/Inverted_index.
- 5.
Vertical Search: https://en.wikipedia.org/wiki/Vertical_search.
- 6.
Markup Language: https://en.wikipedia.org/wiki/Markup_language.
- 7.
Hypertext Markup Language: https://en.wikipedia.org/wiki/HTML.
- 8.
Portable Document Format: https://en.wikipedia.org/wiki/Portable_Document_Format.
- 9.
Standard Operating Procedure: https://en.wikipedia.org/wiki/Standard_operating_procedure.
- 10.
Ontology: https://en.wikipedia.org/wiki/Ontology.
- 11.
Ontology in Computer Science: https://en.wikipedia.org/wiki/Ontology_(information_science).
- 12.
Part of Speech Tagging: https://en.wikipedia.org/wiki/Part-of-speech_tagging.
- 13.
Query Language: https://en.wikipedia.org/wiki/Query_language.
- 14.
Ontology: https://en.wikipedia.org/wiki/Ontology.
- 15.
Ontology in Computer Science: https://en.wikipedia.org/wiki/Ontology_(information_science).
- 16.
Apache Lucene Core: https://lucene.apache.org/core/.
- 17.
Apache Solr: http://lucene.apache.org/solr/features.html.
- 18.
gensim, Topic Modeling for Humans, https://radimrehurek.com/gensim/.
- 19.
Domain Names: https://en.wikipedia.org/wiki/Domain_name.
- 20.
Apache Nutch Web Crawler: http://nutch.apache.org/.
- 21.
NLTK Natural Language Toolkit: http://www.nltk.org/.
- 22.
Industrial-Strength Natural Language Processing: https://spacy.io/.
- 23.
Angular cross platform web framework: https://angular.io/.
- 24.
MAUDE – Manufacturer and User Facility Device Experience: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search.cfm.
References
Herre H (2010) General formal ontology (GFO): a foundational ontology for conceptual modelling. In: Poli R, Healy M, Kameas A (eds) Theory and applications of ontology: computer applications. Springer, Dordrecht, pp 297–345
Uciteli A, Goller C, Burek P, Siemoleit S, Faria B, Galanzina H, Weiland T, Drechsler-Hake D, Bartussek W, Herre H (2014) Search ontology, a new approach towards semantic search. In: Plödereder E, Grunske L, Schneider E, Ull D (eds) FoRESEE: Future Search Engines 2014–44. Annual meeting of the GI, Stuttgart – GI edition proceedings LNI. Köllen, Bonn, pp 667–672
Medical Device Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices, OJ. L (2017) pp 1–175
Acknowledgements
Acknowledgements go to all participants of the OntoPMS consortium. With respect to ontologies, accompanying work flows, and available technologies I would like to thank Prof. Heinrich Herre, Alexandr Uciteli, and Stephan Kropf from the IMISE at the University of Leipzig for many inspiring conversations. I wouldn’t have had much chance to understand medical regulations in Europe without the help of the novineon personnel Timo Weiland (consortium project lead), Prof. Marc O. Schurr, Stefanie Meese, Klaus Gräf, and the quality manager from Ovesco, Matthias Leenen. The participants from the BfArM, the German Federal Institute for Drugs and Medical Devices, with Prof. Wolfgang Lauer and Robin Seidel helped me understand the MAUDEFootnote 24 database and how to connect it to the CorpusBuilder. IntraFind (Christoph Goller and Philipp Blohm) developed an ingenious enhancement to the search engine exploiting the corpus; and MT2IT (Prof. Jörg-Uwe Meyer, Michael Witte) will provide the structures of the overall system where the CorpusBuilder will be embedded. I also would like to thank my colleagues at OntoPort, Anatol Reibold and Günter Lutz-Misof for their astute remarks on earlier versions of this chapter.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer-Verlag GmbH Germany, part of Springer Nature
About this chapter
Cite this chapter
Bartussek, W. (2018). Building Concise Text Corpora from Web Contents. In: Hoppe, T., Humm, B., Reibold, A. (eds) Semantic Applications. Springer Vieweg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-55433-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-662-55433-3_8
Published:
Publisher Name: Springer Vieweg, Berlin, Heidelberg
Print ISBN: 978-3-662-55432-6
Online ISBN: 978-3-662-55433-3
eBook Packages: Computer ScienceComputer Science (R0)