Abstract
Tables are among the most informative components of documents, because they are exploited to compactly and intuitively represent data, typically for understandability purposes. The needs are to identify and extract tables from documents, and, on the other hand, to be able to extract the data they contain. The latter task involves the understanding of a table structure. Due to the variability in style, size, and aims of tables, algorithmic approaches to this task can be insufficient, and the exploitation of machine learning systems may represent an effective solution. This paper proposes the exploitation of a first-order logic representation, that is able to capture the complex spatial relationships involved in a table structure, and of a learning system that can mix the power of this representation with the flexibility of statistical approaches. The obtained encouraging results suggest further investigation and refinement of the proposal.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the International Conference on Data Engineering, pp. 3–14 (1995)
Cafarella, M., Halevy, A., Wang, Z., Wu, E., Zhang, Y.: Webtables: Exploring the power of tables on the web. In: Proceddings of VLDB (2008)
Di Mauro, N., Basile, T.M.A., Ferilli, S., Esposito, F.: Optimizing Probabilistic Models for Relational Sequence Learning. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS, vol. 6804, pp. 240–249. Springer, Heidelberg (2011)
Esposito, F., Di Mauro, N., Basile, T., Ferilli, S.: Multi-dimensional relational sequence mining. Fundamenta Informaticae 89(1), 23–43 (2008)
Esposito, F., Ferilli, S., Basile, T.M., Di Mauro, N.: Machine learning for digital document processing: From layout analysis to metadata extraction. In: Marinai, S., Fujisawa, H. (eds.) Machine Learning in Document Analysis and Recognition. SCI, vol. 90, pp. 105–138. Springer, Heidelberg (2008)
Feo, T., Resende, M.: Greedy randomized adaptive search procedures. Journal of Global Optimization 6, 109–133 (1995)
Ferilli, S., Di Mauro, N., Basile, T.M.A., Esposito, F.: θ-Subsumption and Resolution: A New Algorithm. In: Zhong, N., Raś, Z.W., Tsumoto, S., Suzuki, E. (eds.) ISMIS 2003. LNCS (LNAI), vol. 2871, pp. 384–391. Springer, Heidelberg (2003)
Hoos, H., Stützle, T.: Stochastic Local Search: Foundations & Applications. Morgan Kaufmann Publishers Inc., San Francisco (2004)
Kieninger, T.: Table structure recognition based on robust block segmentation. In: Proc. Document Recognition V, vol. 3305, pp. 22–32. SPIE (1998)
Kim, S., Liu, Y.: Functional-based table category identification in digital library. In: International Conference on Document Analysis and Recognition, pp. 1364–1368 (2011)
Kramer, S., De Raedt, L.: Feature construction with version spaces for biochemical applications. In: Proceedings of the 18th International Conference on Machine Learning, pp. 258–265. Morgan Kaufmann Publishers Inc. (2001)
Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Tableseer: Automatic table metadata extraction and searching in digital libraries categories and subject descriptors. In: Proceedings of JCDL 2007, pp. 91–100 (2007)
Liu, Y., Mitra, P., Giles, C.: Identifying table boundaries in digital documents via sparse line detection. In: Proceedings of CIKM 2008 (2008)
Nagy, G., Padmanabhan, R., Jandhyala, R.C., Silversmith, W., Krishnamoorthy, M.S.: Table metadata: Headers, augmentations and aggregates. In: Ninth IAPR International Workshop on Document Analysis Systems (2010)
Nagy, G., Seth, S.C., Jin, D., Embley, D.W., Machado, S., Krishnamoorthy, M.S.: Data extraction from web tables: The devil is in the details. In: International Conference on Document Analysis and Recognition, pp. 242–246 (2011)
Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. In: Proceedings of WWW, pp. 242–250 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Di Mauro, N., Ferilli, S., Esposito, F. (2013). Learning to Recognize Critical Cells in Document Tables. In: Agosti, M., Esposito, F., Ferilli, S., Ferro, N. (eds) Digital Libraries and Archives. IRCDL 2012. Communications in Computer and Information Science, vol 354. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35834-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-35834-0_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35833-3
Online ISBN: 978-3-642-35834-0
eBook Packages: Computer ScienceComputer Science (R0)