Knowledge Extraction and Modeling from Scientific Publications

Ronzano, Francesco; Saggion, Horacio

doi:10.1007/978-3-319-53637-8_2

Francesco Ronzano¹⁶ &
Horacio Saggion¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9792))

Included in the following conference series:

International Workshop on Semantic, Analytics, Visualization

1174 Accesses
16 Citations

Abstract

During the last decade the amount of scientific articles available online has substantially grown in parallel with the adoption of the Open Access publishing model. Nowadays researchers, as well as any other interested actor, are often overwhelmed by the enormous and continuously growing amount of publications to consider in order to perform any complete and careful assessment of scientific literature. As a consequence, new methodologies and automated tools to ease the extraction, semantic representation and browsing of information from papers are necessary. We propose a platform to automatically extract, enrich and characterize several structural and semantic aspects of scientific publications, representing them as RDF datasets. We analyze papers by relying on the scientific Text Mining Framework developed in the context of the European Project Dr. Inventor. We evaluate how the Framework supports two core scientific text analysis tasks: rhetorical sentence classification and extractive text summarization. To ease the exploration of the distinct facets of scientific knowledge extracted by our platform, we present a set of tailored Web visualizations. We provide on-line access to both the RDF datasets and the Web visualizations generated by mining the papers of the 2015 ACL-IJCNLP Conference.

This work is (partly) supported by the Spanish Ministry of Economy and Competitiveness under the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502) and by the European Project Dr. Inventor (FP7-ICT-2013.8.1 - Grant no: 611383).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.ncbi.nlm.nih.gov/pubmed.
2.
http://www.scopus.com/.
3.
http://www.webofknowledge.com/.
4.
https://doaj.org/.
5.
http://jats.nlm.nih.gov/.
6.
http://www.elsevier.com/author-schemas/elsevier-xml-dtds-and-transport-schemas.
7.
https://rawgit.com/essepuntato/rash/master/documentation/index.html.
8.
http://backingdata.org/dri/library/.
9.
http://pdfx.cs.man.ac.uk/.
10.
http://www.bibsonomy.org/help/doc/api.html.
11.
http://search.crossref.org/help/api.
12.
http://freecite.library.brown.edu/welcome.
13.
https://code.google.com/p/mate-tools/.
14.
http://sempub.taln.upf.edu/dricorpus.
15.
http://babelfy.org/.
16.
http://babelnet.org/.
17.
http://www.taln.upf.edu/pages/summa.upf/.
18.
Rouge-2 is a measure which compares n-grams in automatic summaries to n-grams in gold stadard summaries.
19.
Download link: http://backingdata.org/dri/viz/.
20.
http://www.sparontologies.net/.
21.
http://backingdata.org/dri/viz/.

References

Munroe, R.: The rise of open access. Science 342(6154), 58–59 (2013). https://www.sciencemag.org/content/342/6154/58.full
Article Google Scholar
Björk, B.C., Laakso, M., Welling, P., Paetau, P.: Anatomy of green open access. J. Assoc. Inf. Sci. Technol. 65(2), 237–250 (2014)
Article Google Scholar
Solomon, D.J., Laakso, M., Björk, B.C.: A longitudinal comparison of citation rates and growth among open access journals. J. Inf. 7(3), 642–650 (2013)
Article Google Scholar
Lewis, D.W.: The inevitability of open access. Coll. Res. Libr. 73(5), 493–506 (2012)
Article Google Scholar
Huh, S.: Coding practice of the journal article tag suite extensible markup language. Sci. Editing 1(2), 105–112 (2014)
Article Google Scholar
Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM Symposium on Document Engineering, pp. 177–180. ACM (2013)
Google Scholar
Bohnet, B.: Very high accuracy and fast dependency parsing is not a contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 89–97. Association for Computational Linguistics (2010)
Google Scholar
Tkaczyk, D., Szostek, P., Dendek, P.J., Fedoryszak, M., Bolikowski, L.: CERMINE-automatic extraction of metadata and references from scientific literature. In: 2014 11th IAPR International Workshop on Document Analysis Systems (DAS), pp. 217–221. IEEE (2014)
Google Scholar
Ramakrishnan, C., Patnia, A., Hovy, E.H., Burns, G.A.: Layout-aware text extraction from full-text PDF of scientific articles. Sour. Code Biol. Med. 7(1), 7 (2012)
Article Google Scholar
Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. Inf. Process. Manage. 42(4), 963–979 (2006)
Article Google Scholar
Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 219–228. ACM (2013)
Google Scholar
Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: an open-source CRF reference string parsing package. In: LREC (2008)
Google Scholar
Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. In: Multimedia Storage and Retrieval Innovations for Digital Library Systems, vol. 270 (2012)
Google Scholar
Liakata, M., Saha, S., Dobnik, S., Batchelor, C., Rebholz-Schuhmann, D.: Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics 28(7), 991–1000 (2012)
Article Google Scholar
Teufel, S.: The structure of scientific articles: applications to citation indexing and summarization. Comput. Linguist. 38(2), 443–445 (2012)
Article Google Scholar
Nakov, P.I., Schwartz, A.S., Hearst, M.: Citances: citation sentences for semantic analysis of bioscience text. In: Proceedings of the SIGIR 2004 Workshop on Search and Discovery in Bioinformatics, pp. 81–88 (2004)
Google Scholar
Abu-Jbara, A., Ezra, J., Radev, D.R.: Purpose and polarity of citation: towards NLP-based bibliometrics. In: HLT-NAACL, pp. 596–606 (2013)
Google Scholar
Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 500–509. Association for Computational Linguistics (2011)
Google Scholar
Ronzano, F., Saggion, H.: Taking advantage of citances: citation scope identification and citation-based summarization. In: Text Analytics Conference (2014)
Google Scholar
Smit, E., Van Der Graaf, M.: Journal article mining: the scholarly publishers’ perspective. Learn. Publ. 25(1), 35–46 (2012)
Article Google Scholar
Ciancarini, P., Iorio, A., Nuzzolese, A.G., Peroni, S., Vitali, F.: Semantic annotation of scholarly documents and citations. In: Baldoni, M., Baroglio, C., Boella, G., Micalizio, R. (eds.) AI*IA 2013. LNCS (LNAI), vol. 8249, pp. 336–347. Springer, Cham (2013). doi:10.1007/978-3-319-03524-6_29
Chapter Google Scholar
Sateli, B., Witte, R.: What’s in this paper?: Combining rhetorical entities with linked open data for semantic literature querying. In: Proceedings of the 24th International Conference on World Wide Web Companion, pp. 1023–1028 (2015)
Google Scholar
Shotton, D.: Semantic publishing: the coming revolution in scientific journal publishing. Learn. Publ. 22(2), 85–94 (2009)
Article Google Scholar
Iorio, A.D., Lange, C., Dimou, A., Vahdati, S.: Semantic publishing challenge – assessing the quality of scientific output by information extraction and interlinking. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 65–80. Springer, Cham (2015). doi:10.1007/978-3-319-25518-7_6
Chapter Google Scholar
Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, Ł.: CERMINE: automatic extraction of structured metadata from scientific literature. Int. J. Doc. Anal. Recogn. (IJDAR) 18(4), 317–335 (2015)
Article Google Scholar
Ronzano, F., Saggion, H.: Dr. Inventor framework: extracting structured information from scientific publications. In: Japkowicz, N., Matwin, S. (eds.) DS 2015. LNCS (LNAI), vol. 9356, pp. 209–220. Springer, Cham (2015). doi:10.1007/978-3-319-24282-8_18
Chapter Google Scholar
Cunningham, H., Tablan, V., Roberts, A., Bontcheva, K.: Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS Comput. Biol. 9(2), e1002854 (2013)
Article Google Scholar
Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, Cambridge (2002)
Google Scholar
Fisas, B., Ronzano, F., Saggion, H.: On the discoursive structure of computer graphics research papers. In: The 9th Linguistic Annotation Workshop held in Conjuncion with NAACL 2015, p. 42 (2015)
Google Scholar
Fisas, B., Ronzano, F., Saggion, H.: A multi-layered annotated corpus of scientific papers. In: The Language Resource and Evaluation Conference (2016)
Google Scholar
Mihalcea, R.: Graph-based ranking algorithms for sentence extraction, applied to text summarization. In: Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, p. 20. Association for Computational Linguistics (2004)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, vol. 8 (2004)
Google Scholar
Moro, A., Cecconi, F., Navigli, R.: Multilingual word sense disambiguation and entity linking for everybody. In: Proceedings of ISWC (P&D), pp. 25–28 (2014)
Google Scholar
Saggion, H.: SUMMA: a robust and adaptable summarization tool. Traitement Automatique des Langues 49(2), 103–125 (2008)
Google Scholar
Ronzano, F., Fisas, B., Bosque, G.C., Saggion, H.: On the automated generation of scholarly publishing linked datasets: the case of CEUR-WS proceedings. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 177–188. Springer, Cham (2015). doi:10.1007/978-3-319-25518-7_15
Chapter Google Scholar
Peroni, S.: The semantic publishing and referencing ontologies. In: Peroni, S. (ed.) Semantic Web Technologies and Legal Scholarly Publishing. Law, Governance and Technology Series, vol. 15, pp. 121–193. Springer, Heidelberg (2014)
Google Scholar
Thakker, D., Osman, T., Lakin, P.: Gate jape grammar tutorial. Nottingham Trent University, UK, Phil Lakin, UK, Version 1 (2009)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
O’Donoghue, D.P., Abgaz, Y., Hurley, D., Ronzano, F., Saggion, H.: Stimulating and simulating creativity with Dr. Inventor. In: The Proceedings of the International Conference on Computational Creativity (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Natural Language Processing Group (TALN), Universitat Pompeu Fabra, Barcelona, Spain
Francesco Ronzano & Horacio Saggion

Authors

Francesco Ronzano
View author publications
You can also search for this author in PubMed Google Scholar
Horacio Saggion
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francesco Ronzano .

Editor information

Editors and Affiliations

Oxford e-Research Centre, University of Oxford, Oxford, United Kingdom
Alejandra González-Beltrán
Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom
Francesco Osborne
Dept of Computer Sci & Engineering, University of Bologna, Bologna, Italy
Silvio Peroni

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ronzano, F., Saggion, H. (2016). Knowledge Extraction and Modeling from Scientific Publications. In: González-Beltrán, A., Osborne, F., Peroni, S. (eds) Semantics, Analytics, Visualization. Enhancing Scholarly Data. SAVE-SD 2016. Lecture Notes in Computer Science(), vol 9792. Springer, Cham. https://doi.org/10.1007/978-3-319-53637-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-53637-8_2
Published: 10 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-53636-1
Online ISBN: 978-3-319-53637-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics