Abstract
Web-based technology disseminates and stores abundant information electronically. Today, people is more comfortable to use document in Portable Document Format (PDF) because of its operating system independent. However, information on PDF document which in read-only mode are applicable only for human reader. In addition, PDF consists of non-tagged internal structure which make the extraction task difficult. Automatically details analyzing and recognizing of PDF document structures especially paragraph and tabular area is vital for extracting relevant information precisely for use in other domain applications. A combination of heuristic and rule-based approach is proposed to automatically identify and recognize the structure of PDF document. An experimental study has been conducted using a collection of construction tender documents in PDF to test the performance of the proposed approach. The accuracies of precision, recall and f-measures have shown significant results when detecting tabular and paragraph structure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gantz, J.F., Chute, C., Manfrediz, A., Minton, S., Reinsel, D., Schlichting, W., Toncheva, A.: The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011, IDC White Paper, vol. 2009 (2008)
Froelich, J., Ananyan, S.: Decision Support via Text Mining. In: Burstein, F., Holsapple, C.W. (eds.) Handbook on Decision Support Systems 1., pp. 609–635. Springer, Heidelberg (2008)
Rosmayati, M., Abdul Razak, H., Zulaiha, A.O., Noor Maizura, M.N.: Ontological-based for Supporting Multi Criteria Decision-Making. In: Wen, D., Zhou, J. (eds.) 2010 2nd IEEE International Conference on Information Management and Engineering, vol. 1, pp. 214–217. IEEE Press, Chengdu (2010)
Oro, E., Ruffolo, M.: XONTO: An Ontology-Based System for Semantic Information Extraction from PDF Documents. In: 20th IEEE International Conference on Tools with Artificial Intelligence 2008, pp. 118–125 (2008)
Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines. In: 2009 10th International Conference on Document Analysis and Recognition, Barcelona, Spain, pp. 1006–1010 (2009)
Jiang, D., Yang, X.: Converting PDF to HTML Approach Based on Text Detection. In: 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human, pp. 982–985. ACM Press, New York (2009)
Harvey, G.: Adobe Acrobat 6 PDF For Dummies. Wiley Publishing, Inc., Indiana (2003)
Zanibbi, R., Blostein, D., Cordy, J.R.: A Survey of Table Recognition: Models, Observations, Transformations, and Inferences. International Journal on Document Analysis and Recognition 7, 1–16 (2004)
Hassan, T., Baumgartner, R.: Table Recognition and Understanding from PDF Files. In: International Conference on Document Analysis and Recognition (ICDAR 2007), pp. 1143–1147. Curitiba, Brazil (2007)
Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: A Method to Extract Table Information from PDF Files. In: Indian International Conference on Artificial Intelligence, India, pp. 1773–1785 (2005)
Oro, E., Ruffolo, M.: PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents. In: 10th International Conference on Document Analysis and Recognition 2009, pp. 906–910. IEEE Computer Society, Barcelona (2009)
Schmoekel, I.: PDF-Analyzer Pro 4.0., Vol. 1. Software-Development and Distribution, Achim-Uesen, Germany (2010) 1-11
Amyuni, T.: PDF Vol. 2010. Amyuni Technologies Inc., Montreal, Canada (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mohemad, R., Hamdan, A.R., Ali Othman, Z., Mohamad Noor, N.M. (2011). Automatic Recognition of Document Structure from PDF Files. In: Zain, J.M., Wan Mohd, W.M.b., El-Qawasmeh, E. (eds) Software Engineering and Computer Systems. ICSECS 2011. Communications in Computer and Information Science, vol 181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22203-0_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-22203-0_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22202-3
Online ISBN: 978-3-642-22203-0
eBook Packages: Computer ScienceComputer Science (R0)