Machine Learning Techniques for Automatically Extracting Contextual Information from Scientific Publications

Klampfl, Stefan; Kern, Roman

doi:10.1007/978-3-319-25518-7_9

Stefan Klampfl¹⁴ &
Roman Kern¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 548))

Included in the following conference series:

Semantic Web Evaluation Challenges

860 Accesses
6 Citations

Abstract

Scholarly publishing increasingly requires automated systems that semantically enrich documents in order to support management and quality assessment of scientific output. However, contextual information, such as the authors’ affiliations, references, and funding agencies, is typically hidden within PDF files. To access this information we have developed a processing pipeline that analyses the structure of a PDF document incorporating a diverse set of machine learning techniques. First, unsupervised learning is used to extract contiguous text blocks from the raw character stream as the basic logical units of the article. Next, supervised learning is employed to classify blocks into different meta-data categories, including authors and affiliations. Then, a set of heuristics are applied to detect the reference section at the end of the paper and segment it into individual reference strings. Sequence classification is then utilised to categorise the tokens of individual references to obtain information such as the journal and the year of the reference. Finally, we make use of named entity recognition techniques to extract references to research grants, funding agencies, and EU projects. Our system is modular in nature. Some parts rely on models learnt on training data, and the overall performance scales with the quality of these data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Aiello, M., Monz, C., Todoran, L., Worring, M.: Document understanding for a broad class of documents. Int. J. Doc. Anal. Recogn. 5(1), 1–16 (2002)
Article MATH Google Scholar
Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22(1), 39–71 (1996)
Google Scholar
Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: an open-source CRF reference string parsing package. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of LREC, vol. 2008, pp. 661–667. Citeseer, European Language Resources Association (ELRA) (2008)
Google Scholar
Kern, R., Jack, K., Hristakeva, M., Granitzer, M.: TeamBeam - meta-data extraction from scientific literature. D-Lib Mag. 18(7/8) (2012)
Google Scholar
Kern, R., Klampfl, S.: Extraction of references using layout and formatting information from scientific articles. D-Lib Mag. 19(9/10) (2013)
Google Scholar
Klampfl, S., Granitzer, M., Jack, K., Kern, R.: Unsupervised document structure analysis of digital scientific articles. Int. J. Digit. Libr. 14(3–4), 83–99 (2014)
Article Google Scholar
Klampfl, S., Kern, R.: An unsupervised machine learning approach to body text and table of contents extraction from digital scientific articles. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds.) TPDL 2013. LNCS, vol. 8092, pp. 144–155. Springer, Heidelberg (2013)
Chapter Google Scholar
Kröll, M., Klampfl, S., Kern, R.: Towards a marketplace for the scientific community: accessing knowledge from the computer science domain. D-Lib Mag. 20(11/12) (2014)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (ICML-2001), pp. 282–289 (2001)
Google Scholar
Ratnaparkhi, A.: Maximum entropy models for natural langual ambiguity resolution. Ph.D. thesis (1998)
Google Scholar

Download references

Acknowledgements

The presented work was in part developed within the CODE project (grant no. 296150) and within the EEXCESS project (grant no. 600601) funded by the EU FP7, as well as the TEAM IAPP project (grant no. 251514) within the FP7 People Programme. The Know-Center is funded within the Austrian COMET Program Competence Centers for Excellent Technologies under the auspices of the Austrian Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Economy, Family and Youth and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG.

Author information

Authors and Affiliations

Know-Center GmbH, Inffeldgasse 13, 8010, Graz, Austria
Stefan Klampfl & Roman Kern

Authors

Stefan Klampfl
View author publications
You can also search for this author in PubMed Google Scholar
Roman Kern
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefan Klampfl .

Editor information

Editors and Affiliations

Inria, Sophia Antipolis, France
Fabien Gandon
INRIA Sophia-Antipolis Méditerranée, Sophia Antipolis, France
Elena Cabrio
Université Paris-Sorbonne, Paris, France
Milan Stankovic
École des Mines de Saint-Étienne, Saint-Étienne, France
Antoine Zimmermann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Klampfl, S., Kern, R. (2015). Machine Learning Techniques for Automatically Extracting Contextual Information from Scientific Publications. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds) Semantic Web Evaluation Challenges. SemWebEval 2015. Communications in Computer and Information Science, vol 548. Springer, Cham. https://doi.org/10.1007/978-3-319-25518-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-25518-7_9
Published: 07 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25517-0
Online ISBN: 978-3-319-25518-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics