Abstract
During the last decade the amount of scientific articles available online has substantially grown in parallel with the adoption of the Open Access publishing model. Nowadays researchers, as well as any other interested actor, are often overwhelmed by the enormous and continuously growing amount of publications to consider in order to perform any complete and careful assessment of scientific literature. As a consequence, new methodologies and automated tools to ease the extraction, semantic representation and browsing of information from papers are necessary. We propose a platform to automatically extract, enrich and characterize several structural and semantic aspects of scientific publications, representing them as RDF datasets. We analyze papers by relying on the scientific Text Mining Framework developed in the context of the European Project Dr. Inventor. We evaluate how the Framework supports two core scientific text analysis tasks: rhetorical sentence classification and extractive text summarization. To ease the exploration of the distinct facets of scientific knowledge extracted by our platform, we present a set of tailored Web visualizations. We provide on-line access to both the RDF datasets and the Web visualizations generated by mining the papers of the 2015 ACL-IJCNLP Conference.
This work is (partly) supported by the Spanish Ministry of Economy and Competitiveness under the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502) and by the European Project Dr. Inventor (FP7-ICT-2013.8.1 - Grant no: 611383).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
Rouge-2 is a measure which compares n-grams in automatic summaries to n-grams in gold stadard summaries.
- 19.
Download link: http://backingdata.org/dri/viz/.
- 20.
- 21.
References
Munroe, R.: The rise of open access. Science 342(6154), 58–59 (2013). https://www.sciencemag.org/content/342/6154/58.full
Björk, B.C., Laakso, M., Welling, P., Paetau, P.: Anatomy of green open access. J. Assoc. Inf. Sci. Technol. 65(2), 237–250 (2014)
Solomon, D.J., Laakso, M., Björk, B.C.: A longitudinal comparison of citation rates and growth among open access journals. J. Inf. 7(3), 642–650 (2013)
Lewis, D.W.: The inevitability of open access. Coll. Res. Libr. 73(5), 493–506 (2012)
Huh, S.: Coding practice of the journal article tag suite extensible markup language. Sci. Editing 1(2), 105–112 (2014)
Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM Symposium on Document Engineering, pp. 177–180. ACM (2013)
Bohnet, B.: Very high accuracy and fast dependency parsing is not a contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 89–97. Association for Computational Linguistics (2010)
Tkaczyk, D., Szostek, P., Dendek, P.J., Fedoryszak, M., Bolikowski, L.: CERMINE-automatic extraction of metadata and references from scientific literature. In: 2014 11th IAPR International Workshop on Document Analysis Systems (DAS), pp. 217–221. IEEE (2014)
Ramakrishnan, C., Patnia, A., Hovy, E.H., Burns, G.A.: Layout-aware text extraction from full-text PDF of scientific articles. Sour. Code Biol. Med. 7(1), 7 (2012)
Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. Inf. Process. Manage. 42(4), 963–979 (2006)
Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 219–228. ACM (2013)
Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: an open-source CRF reference string parsing package. In: LREC (2008)
Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. In: Multimedia Storage and Retrieval Innovations for Digital Library Systems, vol. 270 (2012)
Liakata, M., Saha, S., Dobnik, S., Batchelor, C., Rebholz-Schuhmann, D.: Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics 28(7), 991–1000 (2012)
Teufel, S.: The structure of scientific articles: applications to citation indexing and summarization. Comput. Linguist. 38(2), 443–445 (2012)
Nakov, P.I., Schwartz, A.S., Hearst, M.: Citances: citation sentences for semantic analysis of bioscience text. In: Proceedings of the SIGIR 2004 Workshop on Search and Discovery in Bioinformatics, pp. 81–88 (2004)
Abu-Jbara, A., Ezra, J., Radev, D.R.: Purpose and polarity of citation: towards NLP-based bibliometrics. In: HLT-NAACL, pp. 596–606 (2013)
Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 500–509. Association for Computational Linguistics (2011)
Ronzano, F., Saggion, H.: Taking advantage of citances: citation scope identification and citation-based summarization. In: Text Analytics Conference (2014)
Smit, E., Van Der Graaf, M.: Journal article mining: the scholarly publishers’ perspective. Learn. Publ. 25(1), 35–46 (2012)
Ciancarini, P., Iorio, A., Nuzzolese, A.G., Peroni, S., Vitali, F.: Semantic annotation of scholarly documents and citations. In: Baldoni, M., Baroglio, C., Boella, G., Micalizio, R. (eds.) AI*IA 2013. LNCS (LNAI), vol. 8249, pp. 336–347. Springer, Cham (2013). doi:10.1007/978-3-319-03524-6_29
Sateli, B., Witte, R.: What’s in this paper?: Combining rhetorical entities with linked open data for semantic literature querying. In: Proceedings of the 24th International Conference on World Wide Web Companion, pp. 1023–1028 (2015)
Shotton, D.: Semantic publishing: the coming revolution in scientific journal publishing. Learn. Publ. 22(2), 85–94 (2009)
Iorio, A.D., Lange, C., Dimou, A., Vahdati, S.: Semantic publishing challenge – assessing the quality of scientific output by information extraction and interlinking. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 65–80. Springer, Cham (2015). doi:10.1007/978-3-319-25518-7_6
Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, Ł.: CERMINE: automatic extraction of structured metadata from scientific literature. Int. J. Doc. Anal. Recogn. (IJDAR) 18(4), 317–335 (2015)
Ronzano, F., Saggion, H.: Dr. Inventor framework: extracting structured information from scientific publications. In: Japkowicz, N., Matwin, S. (eds.) DS 2015. LNCS (LNAI), vol. 9356, pp. 209–220. Springer, Cham (2015). doi:10.1007/978-3-319-24282-8_18
Cunningham, H., Tablan, V., Roberts, A., Bontcheva, K.: Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS Comput. Biol. 9(2), e1002854 (2013)
Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, Cambridge (2002)
Fisas, B., Ronzano, F., Saggion, H.: On the discoursive structure of computer graphics research papers. In: The 9th Linguistic Annotation Workshop held in Conjuncion with NAACL 2015, p. 42 (2015)
Fisas, B., Ronzano, F., Saggion, H.: A multi-layered annotated corpus of scientific papers. In: The Language Resource and Evaluation Conference (2016)
Mihalcea, R.: Graph-based ranking algorithms for sentence extraction, applied to text summarization. In: Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, p. 20. Association for Computational Linguistics (2004)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, vol. 8 (2004)
Moro, A., Cecconi, F., Navigli, R.: Multilingual word sense disambiguation and entity linking for everybody. In: Proceedings of ISWC (P&D), pp. 25–28 (2014)
Saggion, H.: SUMMA: a robust and adaptable summarization tool. Traitement Automatique des Langues 49(2), 103–125 (2008)
Ronzano, F., Fisas, B., Bosque, G.C., Saggion, H.: On the automated generation of scholarly publishing linked datasets: the case of CEUR-WS proceedings. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 177–188. Springer, Cham (2015). doi:10.1007/978-3-319-25518-7_15
Peroni, S.: The semantic publishing and referencing ontologies. In: Peroni, S. (ed.) Semantic Web Technologies and Legal Scholarly Publishing. Law, Governance and Technology Series, vol. 15, pp. 121–193. Springer, Heidelberg (2014)
Thakker, D., Osman, T., Lakin, P.: Gate jape grammar tutorial. Nottingham Trent University, UK, Phil Lakin, UK, Version 1 (2009)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
O’Donoghue, D.P., Abgaz, Y., Hurley, D., Ronzano, F., Saggion, H.: Stimulating and simulating creativity with Dr. Inventor. In: The Proceedings of the International Conference on Computational Creativity (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Ronzano, F., Saggion, H. (2016). Knowledge Extraction and Modeling from Scientific Publications. In: González-Beltrán, A., Osborne, F., Peroni, S. (eds) Semantics, Analytics, Visualization. Enhancing Scholarly Data. SAVE-SD 2016. Lecture Notes in Computer Science(), vol 9792. Springer, Cham. https://doi.org/10.1007/978-3-319-53637-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-53637-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-53636-1
Online ISBN: 978-3-319-53637-8
eBook Packages: Computer ScienceComputer Science (R0)