PARDA: A Dataset for Scholarly PDF Document Metadata Extraction Evaluation
Abstract
Metadata extraction from scholarly PDF documents is the fundamental work of publishing, archiving, digital library construction, bibliometrics, and scientific competitiveness analysis and evaluations. However, different scholarly PDF documents have different layout and document elements, which make it impossible to compare different extract approaches since testers use different source of test documents even if the documents are from the same journal or conference. Therefore, standard datasets based performance evaluation of various extraction approaches can setup a fair and reproducible comparison. In this paper we present a dataset, namely, PARDA(Pdf Analysis and Recognition DAtaset), for performance evaluation and analysis of scholarly documents, especially on metadata extraction, such as title, authors, affiliation, author-affiliation-email matching, year, date, etc. The dataset covers computer science, physics, life science, management, mathematics, and humanities from various publishers including ACM, IEEE, Springer, Elsevier, arXiv, etc. And each document has distinct layouts and appearance in terms of formatting of metadata. We also construct the ground truth metadata in Dublin Core XML format and BibTex format file associated this dataset.
Keywords
Metadata extraction Dataset Performance evaluation Document analysisNotes
Acknowledgment
The funding support of this work by Natural Science Fund of China (No. 61472109, No. 61572163, No. 61672200, and No. 61772165) is greatly appreciated.
References
- 1.Lipinski, M., Yao, K., Breitinger, C., Beel, J., Gipp, B.: Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In: JCDL 2013 Indianapolis, Indiana, USA, 22–26 July 2013, pp. 385–386 (2010)Google Scholar
- 2.Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: JCDL 2013, Indianapolis, Indiana, USA, 22–26 July 2013, pp. 219–228 (2013)Google Scholar
- 3.Jiang, C., Liu, J., Ou, D., Wang, Y., Yu, L.: Implicit semantics based metadata extraction and matching of scholarly documents. J. Database Manag. (JDM) 29, 1–22 (2018). https://doi.org/10.4018/JDM.2018040101CrossRefGoogle Scholar
- 4.Tkaczyk, D., Szostek, P., Bolikowski, Ł.: GROTOAP2—the methodology of creating a large ground truth dataset of scientific articles. 20(11/12) (2014)Google Scholar
- 5.Märgner, V., El Abed, H.: Tools and metrics for document analysis systems evaluation. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 1011–1036CrossRefGoogle Scholar
- 6.Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.: A realistic dataset for performance evaluation of document layout analysis. In: 10th International Conference on Document Analysis and Recognition, ICDAR 2005 (2005)Google Scholar
- 7.Nartker, T.A., Rice, S.V., Lumos, S.E.: Software tools and test data for research and testing of page-reading OCR systems. In: SPIE and IS&T (2005)Google Scholar
- 8.Todoran, L., Worring, M., Smeulders, A.W.M.: The UvA color document dataset. IJDAR 7, 228–240 (2005)CrossRefGoogle Scholar
- 9.Becker, C., Duretec, K.: Free benchmark corpora for preservation experiments: using model-driven engineering to generate data sets. In: JCDL 2013, pp. 349–358 (2013)Google Scholar
- 10.Caragea, C., et al.: CiteSeerx: a scholarly big dataset. In: de Rijke, Maarten, et al. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 311–322. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_26CrossRefGoogle Scholar
- 11.Antonacopoulos, A., Karatzas, D., Bridson, D.: Ground truth for layout analysis performance evaluation. In: IAPR International Workshop on Document Analysis Systems, DAS 2006 (2006)CrossRefGoogle Scholar
- 12.Tkaczyk, D., Czeczko, A., Rusek, K., Bolikowski, L., Bogacewicz, R.: GROTOAP: ground truth for open access publications. In: JCDL 2012, pp. 381–382 (2012)Google Scholar
- 13.Tao, X., Tang, Z., Xu, C., Gao, L.: Ground-truth and performance evaluation for page layout analysis of born-digital documents. In: 2014 11th IAPR International Workshop on Document Analysis Systems, DAS 2014, pp. 247–251 (2014)Google Scholar
- 14.Valveny, E.: Datasets and annotations for document analysis and recognition. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 983–1009CrossRefGoogle Scholar
- 15.
- 16.Jeffery, K.G., Houssos, N., Jörg, B., Asserson, A.: Research information management: the CERIF approach. Int. J. Metadata Semant. Ontol. 9, 5–14 (2014)CrossRefGoogle Scholar
- 17.