Automatically Identify and Label Sections in Scientific Journals Using Conditional Random Fields

Ramesh, Sree Harsha; Dhar, Arnab; Kumar, Raveena R.; V., Anjaly; K.S., Sarath; Pearce, Jason; Sundaresan, Krishna R.

doi:10.1007/978-3-319-46565-4_21

Automatically Identify and Label Sections in Scientific Journals Using Conditional Random Fields

Sree Harsha Ramesh¹⁴,
Arnab Dhar¹⁴,
Raveena R. Kumar¹⁴,
Anjaly V.¹⁴,
Sarath K.S.¹⁴,
Jason Pearce¹⁵ &
…
Krishna R. Sundaresan¹⁴

Conference paper
First Online: 09 October 2016

714 Accesses
2 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 641))

Abstract

In this paper, we describe a pipeline that automatically converts a journal article in the PDF format to an XML which conforms to NLM JATS DTD. First, the text and typographical features are extracted from the document using character level information. Then, we use a trickle down multi-level conditional random fields based classifier where at each level the pre-trained CRF model classifies a given line of text into one of the tags of DTD at a particular depth and feeds the resulting tag into the next level model as a feature. After identifying tags upto level three, we make use of separate supervised models for parsing authors, affiliations, references and citations. We employ heuristic based methods for matching affiliation to authors, and citation to references. The JATS XML thus generated, is converted into an RDF document. SPARQL queries are run on the RDF, to address the queries of Task 2 of the Semantic Publishing Challenge.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Semantic Publishing Challenge 2016 - https://github.com/ceurws/lod/wiki/SemPub 2016.
2.
Typographical features include information about typefaces, point size and line length.
3.
NLM JATS DTD. http://dtd.nlm.nih.gov/archiving/tag-library/3.0/index.html.
4.
Resource Description Framework (RDF), http://www.w3.org/RDF/.
5.
Apache PDFBox. https://pdfbox.apache.org/.
6.
If it is a binary feature, then set would mean setting the value to 1. If it is a multi-categorical feature, then the values are discrete integers ranging from 0 to (number of buckets - 1).
7.
Stanford NER Tagger: http://nlp.stanford.edu/software/CRF-NER.shtml.
8.
CRF++: https://taku910.github.io/crfpp/.
9.
CoNLL: http://www.cnts.ua.ac.be/conll2000/chunking/.
10.
A subset of scientific journals published CEUR-WS.org - https://github.com/ceurws/lod/wiki/SemPub16_Task2#training-dataset-td2.
11.
Stanford Log-linear Part-Of-Speech Tagger - http://nlp.stanford.edu/software/tagg er.shtml.
12.
Maxmind Free World Cities Database - https://www.maxmind.com/en/free-world- cities-database.
13.
Symbols like *, \(\dagger \), \(\ddagger \) and \(\S \), or numbers 0–9.
14.
Vancouver System of Referencing - https://en.wikipedia.org/wiki/Vancouver_system.
15.
Harvard Referencing - https://en.wikipedia.org/wiki/Parenthetical_referencing.
16.
SPAR - the Semantic Publishing and Referencing Ontologies is an integrated ecosystem of various ontologies like DoCO and CiTO.
17.
Document Components Ontology (DoCO), http://purl.org/spar/doco.
18.
https://github.com/ceurws/lod/wiki/SemPub15_Task2#training-dataset-td2.
19.
https://github.com/ceurws/lod/wiki/SemPub16_Task2#training-dataset-td2.
20.
http://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt.
21.
https://github.com/angelobo/SemPubEvaluator.
22.
https://github.com/ceurws/lod/wiki/SemPub2016#winners.

References

Rosenthol, L.: Developing with PDF: Dive Into the Portable Document Format. O’Reilly Media Inc., Sebastopol (2013)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML, pp. 282–289 (2001)
Google Scholar
Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, Ł.: CERMINE: automatic extraction of structured metadata from scientific literature. Int. J. Doc. Anal. Recogn. (IJDAR) 18, 317–335 (2015). Springer
Article Google Scholar
Klampfl, S., Kern, R.: Machine learning techniques for automatically extracting contextual information from Scientific Publications. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 105–116. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25518-7_9
Chapter Google Scholar
Pembe, F.C., Güngör, T.: Heading-based sectional hierarchy identification for HTML documents. In: 22nd International Symposium on Computer and Information Sciences, ISCIS, pp. 1–6. IEEE (2007)
Google Scholar
Vanderbeck, S., Bockhorst, J., Oldfather, C.: A machine learning approach to identifying sections in legal briefs. In: MAICS, pp. 16–22 (2011)
Google Scholar
Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 219–228. ACM (2013)
Google Scholar
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics (2005)
Google Scholar
Ramshaw, L.A., Mitchell, P.M.: Text chunking using transformation-based learning (1995). arXiv preprint: arXiv:cmp-lg/9505040
Iorio, A.D., Lange, C., Dimou, A., Vahdati, S.: Semantic publishing challenge – assessing the quality of scientific output by information extraction and interlinking. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 65–80. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25518-7_6
Chapter Google Scholar
Lange, C., Di Iorio, A.: Semantic publishing challenge – assessing the quality of scientific output. In: Presutti, V., et al. (eds.) SemWebEval 2014. CCIS, vol. 475, pp. 61–76. Springer, Heidelberg (2014)
Google Scholar
Peroni, S., Lapeyre, D.A., Shotton, D.: From markup to linked data: mapping NISO JATS v1.0 to RDF using the SPAR (Semantic Publishing and Referencing) ontologies. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012 [Internet]. National Center for Biotechnology Information (US), Bethesda (MD) (2012). http://www.ncbi.nlm.nih.gov/books/NBK100491/

Download references

Author information

Authors and Affiliations

Surukam Analytics, Chennai, Tamil Nadu, India
Sree Harsha Ramesh, Arnab Dhar, Raveena R. Kumar, Anjaly V., Sarath K.S. & Krishna R. Sundaresan
Newgen KnowledgeWorks, Chennai, Tamil Nadu, India
Jason Pearce

Authors

Sree Harsha Ramesh
View author publications
You can also search for this author in PubMed Google Scholar
Arnab Dhar
View author publications
You can also search for this author in PubMed Google Scholar
Raveena R. Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Anjaly V.
View author publications
You can also search for this author in PubMed Google Scholar
Sarath K.S.
View author publications
You can also search for this author in PubMed Google Scholar
Jason Pearce
View author publications
You can also search for this author in PubMed Google Scholar
Krishna R. Sundaresan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sree Harsha Ramesh .

Editor information

Editors and Affiliations

IT Systems Engineering, Hasso-Plattner Institute, Potsdam, Germany
Harald Sack
Leibniz Universität Hannover , Hannover, Germany
Stefan Dietze
Elsevier B.V. , Amsterdem, The Netherlands
Anna Tordai
Universität Bonn , Bonn, Germany
Christoph Lange

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ramesh, S.H. et al. (2016). Automatically Identify and Label Sections in Scientific Journals Using Conditional Random Fields. In: Sack, H., Dietze, S., Tordai, A., Lange, C. (eds) Semantic Web Challenges. SemWebEval 2016. Communications in Computer and Information Science, vol 641. Springer, Cham. https://doi.org/10.1007/978-3-319-46565-4_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-46565-4_21
Published: 09 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46564-7
Online ISBN: 978-3-319-46565-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics