Metadata extraction for scientific literature is to automatically annotate each paper with metadata that represents its most valuable information, including problem, method and dataset. Most existing work normally extract keywords or key phrases as concepts for further analysis without their fine-grained types. In this paper, we present a supervised method with three-stages to address the problem. The first step extracts key phrases as metadata candidates, and the second step introduces various features, i.e., statistical features, linguistics features, position features and a novel fine-grained distribution feature which has high relevance with metadata categories, to type the candidates into three foregoing categories. In the evaluation, we conduct extensive experiments on a manually-labeled dataset from ACL Anthology and the results show our proposed method achieves a +3.2% improvement in accuracy compared with strong baseline methods.


Metadata extraction Scientific literature Fine-grained distribution Classification 



This research project is supported by the Major Project of the National Language Committee of the 13rd Five-Year Research Plan in 2016 (ZDI135-3); supported by the Fundamental Research Funds for the Central Universities, and the Research Funds of Beijing Language and Culture University (17YCX148).


  1. 1.
    D’Avanzo, E., Magnini, B.: A keyphrase-based approach to summarization: the LAKE system at DUC-2005. In: Proceedings of DUC (2005)Google Scholar
  2. 2.
    Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. 101(Suppl. 1), 5228–5235 (2004)CrossRefGoogle Scholar
  3. 3.
    Gupta, S., Manning, C.: Analyzing the dynamics of research by extracting key aspects of scientific papers. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 1–9 (2011)Google Scholar
  4. 4.
    Hall, D., Jurafsky, D., Manning, C.D.: Studying the history of ideas using topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 363–371. Association for Computational Linguistics (2008)Google Scholar
  5. 5.
    Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of 2003 Joint Conference on Digital Libraries, pp. 37–48. IEEE (2003)Google Scholar
  6. 6.
    Justeson, J.S., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1(1), 9–27 (1995)CrossRefGoogle Scholar
  7. 7.
    Krishnan, A., Sankar, A., Zhi, S., Han, J.: Unsupervised concept categorization and extraction from scientific document titles. CoRR abs/1710.02271 (2017)Google Scholar
  8. 8.
    Li, N., Zhu, L., Mitra, P., Mueller, K., Poweleit, E., Giles, C.L.: oreChem ChemXSeer: a semantic digital library for chemistry. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries, pp. 245–254. ACM (2010)Google Scholar
  9. 9.
    Liu, J., Shang, J., Wang, C., Ren, X., Han, J.: Mining quality phrases from massive text corpora. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1729–1744. ACM (2015)Google Scholar
  10. 10.
    McKeown, K., et al.: Predicting the impact of scientific concepts using full-text features. J. Assoc. Inf. Sci. Technol. 67(11), 2684–2696 (2016)CrossRefGoogle Scholar
  11. 11.
    Pan, L., Wang, X., Li, C., Li, J., Tang, J.: Course concept extraction in MOOCs via embedding-based graph propagation. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (vol. 1: Long Papers), pp. 875–884 (2017)Google Scholar
  12. 12.
    Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)Google Scholar
  13. 13.
    Prabhakaran, V., Hamilton, W.L., McFarland, D., Jurafsky, D.: Predicting the rise and fall of scientific topics from trends in their rhetorical framing. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), vol. 1, pp. 1170–1180 (2016)Google Scholar
  14. 14.
    Teufel, S., Carletta, J., Moens, M.: An annotation scheme for discourse-level argumentation in research articles. In: Proceedings of the Ninth Conference on European chapter of the Association for Computational Linguistics, pp. 110–117. Association for Computational Linguistics (1999)Google Scholar
  15. 15.
    Teufel, S., Siddharthan, A., Batchelor, C.: Towards discipline-independent argumentative zoning: evidence from chemistry and computational linguistics. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: vol. 3, pp. 1493–1502. Association for Computational Linguistics (2009)Google Scholar
  16. 16.
    Tsai, C.T., Kundu, G., Roth, D.: Concept-based analysis of scientific literature. In: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, pp. 1733–1738. ACM (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Binjie Meng
    • 1
    • 2
  • Lei Hou
    • 3
  • Erhong Yang
    • 1
    • 2
    Email author
  • Juanzi Li
    • 3
  1. 1.Beijing Advanced Innovation Center for Language ResourcesBeijing Language and Culture UniversityBeijingChina
  2. 2.School of Information ScienceBeijing Language and Culture UniversityBeijingChina
  3. 3.Department of Computer Science and TechnologyTsinghua UniversityBeijingChina

Personalised recommendations