Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications

  • Martin KörnerEmail author
  • Behnam Ghavimi
  • Philipp Mayr
  • Heinrich Hartmann
  • Steffen Staab
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 767)


The extraction of individual reference strings from the reference section of scientific publications is an important step in the citation extraction pipeline. Current approaches divide this task into two steps by first detecting the reference section areas and then grouping the text lines in such areas into reference strings. We propose a classification model that considers every line in a publication as a potential part of a reference string. By applying line-based conditional random fields rather than constructing the graphical model based on individual words, dependencies and patterns that are typical in reference sections provide strong features while the overall complexity of the model is reduced. We evaluated our novel approach RefExt against various state-of-the-art tools (CERMINE, GROBID, and ParsCit) and a gold standard which consists of 100 German language full text publications from the social sciences. The evaluation demonstrates that we are able to outperform state-of-the-art tools which rely on the identification of reference section areas.


Reference extraction Citations Conditional random fields German language papers 



This work has been funded by Deutsche Forschungsgemeinschaft (DFG) as part of the project “Extraction of Citations from PDF Documents (EXCITE)” under grant numbers MA 3964/8-1 and STA 572/14-1. We would like to thank Dominika Tkaczyk for her support regarding the CERMINE tool as well as Alexandra Bormann, Jan Hübner, and Daniel Kostić for contributing to the gold standard that was used in this research.


  1. 1.
    Hienert, D., Sawitzki, F., Mayr, P.: Digital library research in action-supporting information retrieval in Sowiport. D-Lib Mag. 21(3/4) (2015)Google Scholar
  2. 2.
    Moed, H.F.: Citation Analysis in Research Evaluation, vol. 9. Springer, Dordrecht (2005)Google Scholar
  3. 3.
    Körner, M.: Reference String Extraction Using Line-Based Conditional Random Fields. ArXiv e-prints (2017)Google Scholar
  4. 4.
    Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. Inf. Process. Manage. 42(4), 963–979 (2006)CrossRefGoogle Scholar
  5. 5.
    Cortez, E., da Silva, A.S., Gonçalves, M.A., Mesquita, F., de Moura, E.S.: FLUX-CiM: flexible unsupervised extraction of citation metadata. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 215–224. ACM (2007)Google Scholar
  6. 6.
    Groza, T., Grimnes, G.A., Handschuh, S.: Reference information extraction and processing using conditional random fields. Inf. Technol. Libr. (Online) 31(2), 6 (2012)Google Scholar
  7. 7.
    Lopez, P.: GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 473–474. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-04346-8_62 CrossRefGoogle Scholar
  8. 8.
    Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: an open-source CRF reference string parsing package. In: Proceedings of LREC, vol. 2008, pp. 661–667 (2008)Google Scholar
  9. 9.
    Wu, J., Williams, K., Chen, H.H., Khabsa, M., Caragea, C., Ororbia, A., Jordan, D., Giles, C.L.: CiteSeerX: AI in a digital library search engine. In: AAAI, pp. 2930–2937 (2014)Google Scholar
  10. 10.
    Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, Ł.: CERMINE: automatic extraction of structured metadata from scientific literature. Int. J. Doc. Anal. Recogn. (IJDAR) 18(4), 317–335 (2015)CrossRefGoogle Scholar
  11. 11.
    Lafferty, J., McCallum, A., Pereira, F., et al.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML, vol. 1, pp. 282–289 (2001)Google Scholar
  12. 12.
    Houngbo, H., Mercer, R.E.: Method mention extraction from scientific research papers. In: COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8–15 December 2012, Mumbai, India, pp. 1211–1222 (2012)Google Scholar
  13. 13.
    Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press (2009)Google Scholar
  14. 14.
    Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM symposium on Document engineering, pp. 177–180. ACM (2013)Google Scholar
  15. 15.
    McCallum, A.K.: MALLET: a machine learning for language toolkit (2002)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Institute for Web Science and TechnologiesUniversity of Koblenz-LandauKoblenzGermany
  2. 2.GESIS – Leibniz Institute for the Social SciencesCologneGermany
  3. 3.IndependentMunichGermany

Personalised recommendations