Skip to main content

An Efficient Pre-processing Method to Identify Logical Components from PDF Documents

  • Conference paper
Book cover Advances in Knowledge Discovery and Data Mining (PAKDD 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6634))

Included in the following conference series:

Abstract

As the rapid growth of the scientific documents in digital libraries, the search demands for the documents as well as specific components increase dramatically. Accurately detecting the component boundary is of vital importance to all the further information extraction and applications. However, document component boundary detection (especially the table, figure, and equation) is a challenging problem because there is no standardized formats and layouts across diverse documents.

This paper presents an efficient document preprocessing technique to improve the document component boundary detection performance by taking advantage of the nature of document lines. Our method easily simplifies the component boundary detection problem into the sparse line analysis problem with much less noise. We define eight document line label types and apply machine learning techniques as well as the heuristic rule-based method on identifying multiple document components. Combining with different heuristic rules, we extract the multiple components in a batch way by filtering out massive noises as early as possible. Our method focus on an important un-tagged document format – PDF documents. The experimental results prove the effectiveness of the sparse line analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Journal Data Mining and Knowledge Discovery 2(2), 121–167 (1998)

    Article  Google Scholar 

  2. Chao, H., Fan, J.: Layout and content extraction for PDF documents. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 213–224. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  3. Chen, S.T.H., Tsai, J.: Mining tables from large scale html texts. In: Proc. 18th International Conference Computational Liguistics, Saarbrucken, Germany (2000)

    Google Scholar 

  4. Ha, J., Haralick, R., Philips, I.: Recursive x-y cut using bounding boxes of connected components. In: Proc. Third International Conference Document Analysis and Recognition, pp. 952–955 (1955)

    Google Scholar 

  5. Hurst, M.: Layout and language: Challenges for table understanding on the web. In: Proceedings of the International Workshop on Web Document Analysis, pp. 27–30 (2001)

    Google Scholar 

  6. Shin, N.G.J.: Table recognition and evaluation. In: Proceeding of the Class of 2005 Senior Conference, Computer Science Department, Swarthmore College, pp. 8–13 (2005)

    Google Scholar 

  7. Joachims, T.: Svm light, http://svmlight.joachims.org/

  8. Kieninger, T., Dengel, A.: Applying the t-rec table recognition system to the business letter domain. In: In Proc. of the 6th International Conference on Document Analysis and Recognition, pp. 518–522 (September 2001)

    Google Scholar 

  9. Kieninger, T.G.: Table structure recognition based on robust block segmentation. In: Proceeding of Document Recognition V, SPIE, vol. 3305, pp. 22–32 (January 1998)

    Google Scholar 

  10. Krupl, B., Herzog, M., Gatterbauer, W.: Using visual cues for extraction of tabular data from arbitrary html documents. In: Proceeding of the 14th International Conference on World Wide Web, pp. 1000–1001 (2005)

    Google Scholar 

  11. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th ICML, pp. 282–289. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  12. Liu, Y., Bai, K., Mitra, P., Giles, C.L.: TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries. In: ACM/IEEE Joint Conference on Digital Libraries, JCDL, pp. 91–100 (2007)

    Google Scholar 

  13. Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Improving the table boundary detection in pdfs by fixing the sequence error of the sparse lines. In: 10th International Conference on Document Analysis and Recognition, ICDAR 2009 (2009)

    Google Scholar 

  14. Liu, Y., Mitra, P., Giles, C.L.: Identifying Table Boundaries in Digital Documents via Sparse Line Detection. In: CIKM 2008, Napa Valley, California (2008)

    Google Scholar 

  15. McCallum, A.: Efficiently inducing features of conditional random fields. In: Nineteenth Conference on UAI (2003)

    Google Scholar 

  16. McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields. In: CONLL 2003 Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4 (2003)

    Google Scholar 

  17. Ng, H., Lim, C., Koo, J.: Learning to recognize tables in free text. In: ACL Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (1999)

    Google Scholar 

  18. Ng, H., Lim, C.Y., Koo, J.T.: Learning to recognize tables in free text. In: Proc. of the 37th Annual Meeting of the Association of Computational Linguistics on Computational Linguistics, pp. 443–450 (1999)

    Google Scholar 

  19. Penn, G., Hu, J., Luo, H., McDonald, R.: Flexible web document analysis for delivery to narrow-bandwidth devices. In: Sixth International Conference on Document Analysis and Recognition (2001)

    Google Scholar 

  20. Pinto, D., McCallum, A., Wei, X., Bruce, W.: Table extraction using conditional random fields. In: Proceeding of Proceedings of the 26th ACM SIGIR, Toronto, Canada (July 2003)

    Google Scholar 

  21. Safavian, S., Landgrebe, D.: A survey of decision tree classifier methodology. SMC 21(3), 660–674 (1991)

    MathSciNet  Google Scholar 

  22. Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: NAACL 2003, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1 (2003)

    Google Scholar 

  23. Shamilian, J., Baird, H., Wood, T.: A retargetable table reader. In: Proc. of the 4th Int’l Conf. on Document Analysis and Recognition, pp. 158–163 (1997)

    Google Scholar 

  24. Stoffel, A., Spretke, D., Kinnemann, H., Keim, D.A.: Enhancing document structure analysis using visual analytics. In: SAC 2010 Proceedings of the 2010 ACM Symposium on Applied Computing (2010)

    Google Scholar 

  25. Wang, J., Hu, J.: A machine learning based approach for table detection on the web. In: The Eleventh International World Wide Web Conference 2002, pp. 242–250 (November 2002)

    Google Scholar 

  26. Wang, Y., Hu, J.: Detecting tables in html documents. In: Proc. of the 5th IAPR DAS, Princeton, NJ (2002)

    Google Scholar 

  27. Wang, Y., Philips, I., Haralick, R.: Automatic table ground truth generation and a background-analysis-based table structure extraction method. In: Proc. of the 6th Int’l Conference on Document Analysis and Recognition, p. 528 (September 2001)

    Google Scholar 

  28. Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: A method to extract table information from pdf files. In: Proceedings of the 2nd Indian International Conference on Artificial Intelligence IICAI 2005, Pune, India (2005)

    Google Scholar 

  29. Yoshida, M., Torisawa, K., Tsujii, J.: A method to integrate tables of the world wide web. In: Proceedings of the International Workshop on Web Document Analysis (WDA 2001) (2001)

    Google Scholar 

  30. Zanibbi, R., Blostein, D., Cordy, J.: A survey of table recognition: Models, observations, transformations, and inferences. Int’l J. Document Analysis and Recognition 7(1), 1–16 (2004)

    Google Scholar 

  31. Zheng, Z.: Naive bayesian classifier committees. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 196–207. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, Y., Bai, K., Gao, L. (2011). An Efficient Pre-processing Method to Identify Logical Components from PDF Documents. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6634. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20841-6_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20841-6_41

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20840-9

  • Online ISBN: 978-3-642-20841-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics