Automatic Recognition of Document Structure from PDF Files

Mohemad, Rosmayati; Hamdan, Abdul Razak; Ali Othman, Zulaiha; Mohamad Noor, Noor Maizura

doi:10.1007/978-3-642-22203-0_24

Rosmayati Mohemad^3,4,
Abdul Razak Hamdan³,
Zulaiha Ali Othman³ &
…
Noor Maizura Mohamad Noor⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 181))

Included in the following conference series:

International Conference on Software Engineering and Computer Systems

1572 Accesses
1 Citations

Abstract

Web-based technology disseminates and stores abundant information electronically. Today, people is more comfortable to use document in Portable Document Format (PDF) because of its operating system independent. However, information on PDF document which in read-only mode are applicable only for human reader. In addition, PDF consists of non-tagged internal structure which make the extraction task difficult. Automatically details analyzing and recognizing of PDF document structures especially paragraph and tabular area is vital for extracting relevant information precisely for use in other domain applications. A combination of heuristic and rule-based approach is proposed to automatically identify and recognize the structure of PDF document. An experimental study has been conducted using a collection of construction tender documents in PDF to test the performance of the proposed approach. The accuracies of precision, recall and f-measures have shown significant results when detecting tabular and paragraph structure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gantz, J.F., Chute, C., Manfrediz, A., Minton, S., Reinsel, D., Schlichting, W., Toncheva, A.: The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011, IDC White Paper, vol. 2009 (2008)
Google Scholar
Froelich, J., Ananyan, S.: Decision Support via Text Mining. In: Burstein, F., Holsapple, C.W. (eds.) Handbook on Decision Support Systems 1., pp. 609–635. Springer, Heidelberg (2008)
Chapter Google Scholar
Rosmayati, M., Abdul Razak, H., Zulaiha, A.O., Noor Maizura, M.N.: Ontological-based for Supporting Multi Criteria Decision-Making. In: Wen, D., Zhou, J. (eds.) 2010 2nd IEEE International Conference on Information Management and Engineering, vol. 1, pp. 214–217. IEEE Press, Chengdu (2010)
Google Scholar
Oro, E., Ruffolo, M.: XONTO: An Ontology-Based System for Semantic Information Extraction from PDF Documents. In: 20th IEEE International Conference on Tools with Artificial Intelligence 2008, pp. 118–125 (2008)
Google Scholar
Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines. In: 2009 10th International Conference on Document Analysis and Recognition, Barcelona, Spain, pp. 1006–1010 (2009)
Google Scholar
Jiang, D., Yang, X.: Converting PDF to HTML Approach Based on Text Detection. In: 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human, pp. 982–985. ACM Press, New York (2009)
Google Scholar
Harvey, G.: Adobe Acrobat 6 PDF For Dummies. Wiley Publishing, Inc., Indiana (2003)
Google Scholar
Zanibbi, R., Blostein, D., Cordy, J.R.: A Survey of Table Recognition: Models, Observations, Transformations, and Inferences. International Journal on Document Analysis and Recognition 7, 1–16 (2004)
Article Google Scholar
Hassan, T., Baumgartner, R.: Table Recognition and Understanding from PDF Files. In: International Conference on Document Analysis and Recognition (ICDAR 2007), pp. 1143–1147. Curitiba, Brazil (2007)
Google Scholar
Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: A Method to Extract Table Information from PDF Files. In: Indian International Conference on Artificial Intelligence, India, pp. 1773–1785 (2005)
Google Scholar
Oro, E., Ruffolo, M.: PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents. In: 10th International Conference on Document Analysis and Recognition 2009, pp. 906–910. IEEE Computer Society, Barcelona (2009)
Chapter Google Scholar
Schmoekel, I.: PDF-Analyzer Pro 4.0., Vol. 1. Software-Development and Distribution, Achim-Uesen, Germany (2010) 1-11
Google Scholar
Amyuni, T.: PDF Vol. 2010. Amyuni Technologies Inc., Montreal, Canada (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor Darul Ehsan, Malaysia
Rosmayati Mohemad, Abdul Razak Hamdan & Zulaiha Ali Othman
Department of Computer Science, Faculty Science and Technology, Universiti Malaysia, Terengganu, 21030, Kuala Terengganu, Terengganu Darul Iman, Malaysia
Rosmayati Mohemad & Noor Maizura Mohamad Noor

Authors

Rosmayati Mohemad
View author publications
You can also search for this author in PubMed Google Scholar
Abdul Razak Hamdan
View author publications
You can also search for this author in PubMed Google Scholar
Zulaiha Ali Othman
View author publications
You can also search for this author in PubMed Google Scholar
Noor Maizura Mohamad Noor
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Computer Systems and Software Engineering, Universiti Malaysia Pahang, Lebuhraya Tun Razak, 26300, Gambang, Kuantan, Pahang, Malaysia
Jasni Mohamad Zain & Wan Maseri bt Wan Mohd &
Information Systems Department, King Saud University, 11543, Riyadh, Saudi Arabia
Eyas El-Qawasmeh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mohemad, R., Hamdan, A.R., Ali Othman, Z., Mohamad Noor, N.M. (2011). Automatic Recognition of Document Structure from PDF Files. In: Zain, J.M., Wan Mohd, W.M.b., El-Qawasmeh, E. (eds) Software Engineering and Computer Systems. ICSECS 2011. Communications in Computer and Information Science, vol 181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22203-0_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-22203-0_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22202-3
Online ISBN: 978-3-642-22203-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics