Skip to main content

Automatic Recognition of Document Structure from PDF Files

  • Conference paper
Software Engineering and Computer Systems (ICSECS 2011)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 181))

Included in the following conference series:

Abstract

Web-based technology disseminates and stores abundant information electronically. Today, people is more comfortable to use document in Portable Document Format (PDF) because of its operating system independent. However, information on PDF document which in read-only mode are applicable only for human reader. In addition, PDF consists of non-tagged internal structure which make the extraction task difficult. Automatically details analyzing and recognizing of PDF document structures especially paragraph and tabular area is vital for extracting relevant information precisely for use in other domain applications. A combination of heuristic and rule-based approach is proposed to automatically identify and recognize the structure of PDF document. An experimental study has been conducted using a collection of construction tender documents in PDF to test the performance of the proposed approach. The accuracies of precision, recall and f-measures have shown significant results when detecting tabular and paragraph structure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gantz, J.F., Chute, C., Manfrediz, A., Minton, S., Reinsel, D., Schlichting, W., Toncheva, A.: The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011, IDC White Paper, vol. 2009 (2008)

    Google Scholar 

  2. Froelich, J., Ananyan, S.: Decision Support via Text Mining. In: Burstein, F., Holsapple, C.W. (eds.) Handbook on Decision Support Systems 1., pp. 609–635. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  3. Rosmayati, M., Abdul Razak, H., Zulaiha, A.O., Noor Maizura, M.N.: Ontological-based for Supporting Multi Criteria Decision-Making. In: Wen, D., Zhou, J. (eds.) 2010 2nd IEEE International Conference on Information Management and Engineering, vol. 1, pp. 214–217. IEEE Press, Chengdu (2010)

    Google Scholar 

  4. Oro, E., Ruffolo, M.: XONTO: An Ontology-Based System for Semantic Information Extraction from PDF Documents. In: 20th IEEE International Conference on Tools with Artificial Intelligence 2008, pp. 118–125 (2008)

    Google Scholar 

  5. Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines. In: 2009 10th International Conference on Document Analysis and Recognition, Barcelona, Spain, pp. 1006–1010 (2009)

    Google Scholar 

  6. Jiang, D., Yang, X.: Converting PDF to HTML Approach Based on Text Detection. In: 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human, pp. 982–985. ACM Press, New York (2009)

    Google Scholar 

  7. Harvey, G.: Adobe Acrobat 6 PDF For Dummies. Wiley Publishing, Inc., Indiana (2003)

    Google Scholar 

  8. Zanibbi, R., Blostein, D., Cordy, J.R.: A Survey of Table Recognition: Models, Observations, Transformations, and Inferences. International Journal on Document Analysis and Recognition 7, 1–16 (2004)

    Article  Google Scholar 

  9. Hassan, T., Baumgartner, R.: Table Recognition and Understanding from PDF Files. In: International Conference on Document Analysis and Recognition (ICDAR 2007), pp. 1143–1147. Curitiba, Brazil (2007)

    Google Scholar 

  10. Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: A Method to Extract Table Information from PDF Files. In: Indian International Conference on Artificial Intelligence, India, pp. 1773–1785 (2005)

    Google Scholar 

  11. Oro, E., Ruffolo, M.: PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents. In: 10th International Conference on Document Analysis and Recognition 2009, pp. 906–910. IEEE Computer Society, Barcelona (2009)

    Chapter  Google Scholar 

  12. Schmoekel, I.: PDF-Analyzer Pro 4.0., Vol. 1. Software-Development and Distribution, Achim-Uesen, Germany (2010) 1-11

    Google Scholar 

  13. Amyuni, T.: PDF Vol. 2010. Amyuni Technologies Inc., Montreal, Canada (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mohemad, R., Hamdan, A.R., Ali Othman, Z., Mohamad Noor, N.M. (2011). Automatic Recognition of Document Structure from PDF Files. In: Zain, J.M., Wan Mohd, W.M.b., El-Qawasmeh, E. (eds) Software Engineering and Computer Systems. ICSECS 2011. Communications in Computer and Information Science, vol 181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22203-0_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-22203-0_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-22202-3

  • Online ISBN: 978-3-642-22203-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics