Abstract
As the rapid growth of the scientific documents in digital libraries, the search demands for the documents as well as specific components increase dramatically. Accurately detecting the component boundary is of vital importance to all the further information extraction and applications. However, document component boundary detection (especially the table, figure, and equation) is a challenging problem because there is no standardized formats and layouts across diverse documents.
This paper presents an efficient document preprocessing technique to improve the document component boundary detection performance by taking advantage of the nature of document lines. Our method easily simplifies the component boundary detection problem into the sparse line analysis problem with much less noise. We define eight document line label types and apply machine learning techniques as well as the heuristic rule-based method on identifying multiple document components. Combining with different heuristic rules, we extract the multiple components in a batch way by filtering out massive noises as early as possible. Our method focus on an important un-tagged document format – PDF documents. The experimental results prove the effectiveness of the sparse line analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Journal Data Mining and Knowledge Discovery 2(2), 121–167 (1998)
Chao, H., Fan, J.: Layout and content extraction for PDF documents. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 213–224. Springer, Heidelberg (2004)
Chen, S.T.H., Tsai, J.: Mining tables from large scale html texts. In: Proc. 18th International Conference Computational Liguistics, Saarbrucken, Germany (2000)
Ha, J., Haralick, R., Philips, I.: Recursive x-y cut using bounding boxes of connected components. In: Proc. Third International Conference Document Analysis and Recognition, pp. 952–955 (1955)
Hurst, M.: Layout and language: Challenges for table understanding on the web. In: Proceedings of the International Workshop on Web Document Analysis, pp. 27–30 (2001)
Shin, N.G.J.: Table recognition and evaluation. In: Proceeding of the Class of 2005 Senior Conference, Computer Science Department, Swarthmore College, pp. 8–13 (2005)
Joachims, T.: Svm light, http://svmlight.joachims.org/
Kieninger, T., Dengel, A.: Applying the t-rec table recognition system to the business letter domain. In: In Proc. of the 6th International Conference on Document Analysis and Recognition, pp. 518–522 (September 2001)
Kieninger, T.G.: Table structure recognition based on robust block segmentation. In: Proceeding of Document Recognition V, SPIE, vol. 3305, pp. 22–32 (January 1998)
Krupl, B., Herzog, M., Gatterbauer, W.: Using visual cues for extraction of tabular data from arbitrary html documents. In: Proceeding of the 14th International Conference on World Wide Web, pp. 1000–1001 (2005)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th ICML, pp. 282–289. Morgan Kaufmann, San Francisco (2001)
Liu, Y., Bai, K., Mitra, P., Giles, C.L.: TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries. In: ACM/IEEE Joint Conference on Digital Libraries, JCDL, pp. 91–100 (2007)
Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Improving the table boundary detection in pdfs by fixing the sequence error of the sparse lines. In: 10th International Conference on Document Analysis and Recognition, ICDAR 2009 (2009)
Liu, Y., Mitra, P., Giles, C.L.: Identifying Table Boundaries in Digital Documents via Sparse Line Detection. In: CIKM 2008, Napa Valley, California (2008)
McCallum, A.: Efficiently inducing features of conditional random fields. In: Nineteenth Conference on UAI (2003)
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields. In: CONLL 2003 Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4 (2003)
Ng, H., Lim, C., Koo, J.: Learning to recognize tables in free text. In: ACL Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (1999)
Ng, H., Lim, C.Y., Koo, J.T.: Learning to recognize tables in free text. In: Proc. of the 37th Annual Meeting of the Association of Computational Linguistics on Computational Linguistics, pp. 443–450 (1999)
Penn, G., Hu, J., Luo, H., McDonald, R.: Flexible web document analysis for delivery to narrow-bandwidth devices. In: Sixth International Conference on Document Analysis and Recognition (2001)
Pinto, D., McCallum, A., Wei, X., Bruce, W.: Table extraction using conditional random fields. In: Proceeding of Proceedings of the 26th ACM SIGIR, Toronto, Canada (July 2003)
Safavian, S., Landgrebe, D.: A survey of decision tree classifier methodology. SMC 21(3), 660–674 (1991)
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: NAACL 2003, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1 (2003)
Shamilian, J., Baird, H., Wood, T.: A retargetable table reader. In: Proc. of the 4th Int’l Conf. on Document Analysis and Recognition, pp. 158–163 (1997)
Stoffel, A., Spretke, D., Kinnemann, H., Keim, D.A.: Enhancing document structure analysis using visual analytics. In: SAC 2010 Proceedings of the 2010 ACM Symposium on Applied Computing (2010)
Wang, J., Hu, J.: A machine learning based approach for table detection on the web. In: The Eleventh International World Wide Web Conference 2002, pp. 242–250 (November 2002)
Wang, Y., Hu, J.: Detecting tables in html documents. In: Proc. of the 5th IAPR DAS, Princeton, NJ (2002)
Wang, Y., Philips, I., Haralick, R.: Automatic table ground truth generation and a background-analysis-based table structure extraction method. In: Proc. of the 6th Int’l Conference on Document Analysis and Recognition, p. 528 (September 2001)
Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: A method to extract table information from pdf files. In: Proceedings of the 2nd Indian International Conference on Artificial Intelligence IICAI 2005, Pune, India (2005)
Yoshida, M., Torisawa, K., Tsujii, J.: A method to integrate tables of the world wide web. In: Proceedings of the International Workshop on Web Document Analysis (WDA 2001) (2001)
Zanibbi, R., Blostein, D., Cordy, J.: A survey of table recognition: Models, observations, transformations, and inferences. Int’l J. Document Analysis and Recognition 7(1), 1–16 (2004)
Zheng, Z.: Naive bayesian classifier committees. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 196–207. Springer, Heidelberg (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, Y., Bai, K., Gao, L. (2011). An Efficient Pre-processing Method to Identify Logical Components from PDF Documents. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6634. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20841-6_41
Download citation
DOI: https://doi.org/10.1007/978-3-642-20841-6_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20840-9
Online ISBN: 978-3-642-20841-6
eBook Packages: Computer ScienceComputer Science (R0)