Abstract
In this paper, we describe a pipeline that automatically converts a journal article in the PDF format to an XML which conforms to NLM JATS DTD. First, the text and typographical features are extracted from the document using character level information. Then, we use a trickle down multi-level conditional random fields based classifier where at each level the pre-trained CRF model classifies a given line of text into one of the tags of DTD at a particular depth and feeds the resulting tag into the next level model as a feature. After identifying tags upto level three, we make use of separate supervised models for parsing authors, affiliations, references and citations. We employ heuristic based methods for matching affiliation to authors, and citation to references. The JATS XML thus generated, is converted into an RDF document. SPARQL queries are run on the RDF, to address the queries of Task 2 of the Semantic Publishing Challenge.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Semantic Publishing Challenge 2016 - https://github.com/ceurws/lod/wiki/SemPub 2016.
- 2.
Typographical features include information about typefaces, point size and line length.
- 3.
NLM JATS DTD. http://dtd.nlm.nih.gov/archiving/tag-library/3.0/index.html.
- 4.
Resource Description Framework (RDF), http://www.w3.org/RDF/.
- 5.
Apache PDFBox. https://pdfbox.apache.org/.
- 6.
If it is a binary feature, then set would mean setting the value to 1. If it is a multi-categorical feature, then the values are discrete integers ranging from 0 to (number of buckets - 1).
- 7.
Stanford NER Tagger: http://nlp.stanford.edu/software/CRF-NER.shtml.
- 8.
CRF++: https://taku910.github.io/crfpp/.
- 9.
- 10.
A subset of scientific journals published CEUR-WS.org - https://github.com/ceurws/lod/wiki/SemPub16_Task2#training-dataset-td2.
- 11.
Stanford Log-linear Part-Of-Speech Tagger - http://nlp.stanford.edu/software/tagg er.shtml.
- 12.
Maxmind Free World Cities Database - https://www.maxmind.com/en/free-world- cities-database.
- 13.
Symbols like *, \(\dagger \), \(\ddagger \) and \(\S \), or numbers 0–9.
- 14.
Vancouver System of Referencing - https://en.wikipedia.org/wiki/Vancouver_system.
- 15.
Harvard Referencing - https://en.wikipedia.org/wiki/Parenthetical_referencing.
- 16.
SPAR - the Semantic Publishing and Referencing Ontologies is an integrated ecosystem of various ontologies like DoCO and CiTO.
- 17.
Document Components Ontology (DoCO), http://purl.org/spar/doco.
- 18.
- 19.
- 20.
- 21.
- 22.
References
Rosenthol, L.: Developing with PDF: Dive Into the Portable Document Format. O’Reilly Media Inc., Sebastopol (2013)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML, pp. 282–289 (2001)
Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, Ł.: CERMINE: automatic extraction of structured metadata from scientific literature. Int. J. Doc. Anal. Recogn. (IJDAR) 18, 317–335 (2015). Springer
Klampfl, S., Kern, R.: Machine learning techniques for automatically extracting contextual information from Scientific Publications. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 105–116. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25518-7_9
Pembe, F.C., Güngör, T.: Heading-based sectional hierarchy identification for HTML documents. In: 22nd International Symposium on Computer and Information Sciences, ISCIS, pp. 1–6. IEEE (2007)
Vanderbeck, S., Bockhorst, J., Oldfather, C.: A machine learning approach to identifying sections in legal briefs. In: MAICS, pp. 16–22 (2011)
Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 219–228. ACM (2013)
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics (2005)
Ramshaw, L.A., Mitchell, P.M.: Text chunking using transformation-based learning (1995). arXiv preprint: arXiv:cmp-lg/9505040
Iorio, A.D., Lange, C., Dimou, A., Vahdati, S.: Semantic publishing challenge – assessing the quality of scientific output by information extraction and interlinking. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 65–80. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25518-7_6
Lange, C., Di Iorio, A.: Semantic publishing challenge – assessing the quality of scientific output. In: Presutti, V., et al. (eds.) SemWebEval 2014. CCIS, vol. 475, pp. 61–76. Springer, Heidelberg (2014)
Peroni, S., Lapeyre, D.A., Shotton, D.: From markup to linked data: mapping NISO JATS v1.0 to RDF using the SPAR (Semantic Publishing and Referencing) ontologies. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012 [Internet]. National Center for Biotechnology Information (US), Bethesda (MD) (2012). http://www.ncbi.nlm.nih.gov/books/NBK100491/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Ramesh, S.H. et al. (2016). Automatically Identify and Label Sections in Scientific Journals Using Conditional Random Fields. In: Sack, H., Dietze, S., Tordai, A., Lange, C. (eds) Semantic Web Challenges. SemWebEval 2016. Communications in Computer and Information Science, vol 641. Springer, Cham. https://doi.org/10.1007/978-3-319-46565-4_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-46565-4_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46564-7
Online ISBN: 978-3-319-46565-4
eBook Packages: Computer ScienceComputer Science (R0)