Abstract
Information extraction from the PDF sources is a tedious task. Most of the existing approaches use either tag-based format such as HTML and XML, or Plain-text format for the extraction of information. In this paper, we present an information extraction technique for research papers which exploits both XML and text formats intelligently. The various patterns and rules are prepared from integrated formats. Furthermore, the intelligent processing of XML and Plain-text for various situations compliments the approach to achieve high accuracy. The proposed approach is a heuristic based approach that extracts the information about logical structure and supportive materials of research papers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 219–228. ACM (2013)
Di Iorio, A., Peroni, S., Poggi, F., Vitali, F., Shotton, D.: Recognising document components in XML-based academic articles. In: Proceedings of the 2013 ACM Symposium on Document Engineering, pp. 181–184. ACM (2013)
Kim, S., Cho, Y., Ahn, K.: Semi-automatic metadata extraction from scientific journal article for full-text XML conversion. In: Proceedings of the International Conference on Data Mining (DMIN), p. 1 (2014). The Steering Committee of the World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp)
Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. In: Multimedia Storage and Retrieval Innovations for Digital Library Systems, vol. 270 (2012)
Milicka, M., Burget, R.: Information extraction from web sources based on multi-aspect content analysis. In: Gandon, F., et al. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 81–92. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25518-7_7
Mohemad, R., Hamdan, A.R., Othman, Z.A., Noor, N.M.: Automatic document structure analysis of structured PDF files. Int. J. New Comput. Architect. Appl. (IJNCAA) 1(2), 404–411 (2011)
Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. Proc. VLDB Endow. 8(12), 1606–1617 (2015)
Nuno, M., Fátima, R.: Extracting structure, text and entities from PDF documents of the portuguese legislation. Institute of Engineering, Polytechnic of Porto, Portugal (2012)
Ramakrishnan, C., Patnia, A., Hovy, E., Burns, G.A.: Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol. Med. 7(1), 1 (2012)
Saleem, O., Latif, S.: Information extraction from research papers by data integration and data validation from multiple header extraction sources. In: Proceedings of the World Congress on Engineering and Computer Science, vol. 1 (2012)
Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM Symposium on Document Engineering, pp. 177–180. ACM (2013)
Acknowledgments
We would like to thank the computer science department of CUST (Capital University of Science & Technology) to provide us research Lab and knowledgeable support to conduct this research activity. We are also thankful to the members of CDSC (Center for Distributed & Semantic Computing) research group for supporting us in different occasion to complete this research work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Ahmad, R., Afzal, M.T., Qadir, M.A. (2016). Information Extraction from PDF Sources Based on Rule-Based System Using Integrated Formats. In: Sack, H., Dietze, S., Tordai, A., Lange, C. (eds) Semantic Web Challenges. SemWebEval 2016. Communications in Computer and Information Science, vol 641. Springer, Cham. https://doi.org/10.1007/978-3-319-46565-4_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-46565-4_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46564-7
Online ISBN: 978-3-319-46565-4
eBook Packages: Computer ScienceComputer Science (R0)