Learning to Recognize Critical Cells in Document Tables

Di Mauro, Nicola; Ferilli, Stefano; Esposito, Floriana

doi:10.1007/978-3-642-35834-0_12

Learning to Recognize Critical Cells in Document Tables

Nicola Di Mauro^3,4,
Stefano Ferilli^3,4 &
Floriana Esposito^3,4

Conference paper

1220 Accesses
2 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 354))

Abstract

Tables are among the most informative components of documents, because they are exploited to compactly and intuitively represent data, typically for understandability purposes. The needs are to identify and extract tables from documents, and, on the other hand, to be able to extract the data they contain. The latter task involves the understanding of a table structure. Due to the variability in style, size, and aims of tables, algorithmic approaches to this task can be insufficient, and the exploitation of machine learning systems may represent an effective solution. This paper proposes the exploitation of a first-order logic representation, that is able to capture the complex spatial relationships involved in a table structure, and of a learning system that can mix the power of this representation with the flexibility of statistical approaches. The obtained encouraging results suggest further investigation and refinement of the proposal.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the International Conference on Data Engineering, pp. 3–14 (1995)
Google Scholar
Cafarella, M., Halevy, A., Wang, Z., Wu, E., Zhang, Y.: Webtables: Exploring the power of tables on the web. In: Proceddings of VLDB (2008)
Google Scholar
Di Mauro, N., Basile, T.M.A., Ferilli, S., Esposito, F.: Optimizing Probabilistic Models for Relational Sequence Learning. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS, vol. 6804, pp. 240–249. Springer, Heidelberg (2011)
Chapter Google Scholar
Esposito, F., Di Mauro, N., Basile, T., Ferilli, S.: Multi-dimensional relational sequence mining. Fundamenta Informaticae 89(1), 23–43 (2008)
MATH Google Scholar
Esposito, F., Ferilli, S., Basile, T.M., Di Mauro, N.: Machine learning for digital document processing: From layout analysis to metadata extraction. In: Marinai, S., Fujisawa, H. (eds.) Machine Learning in Document Analysis and Recognition. SCI, vol. 90, pp. 105–138. Springer, Heidelberg (2008)
Chapter Google Scholar
Feo, T., Resende, M.: Greedy randomized adaptive search procedures. Journal of Global Optimization 6, 109–133 (1995)
Article MathSciNet MATH Google Scholar
Ferilli, S., Di Mauro, N., Basile, T.M.A., Esposito, F.: θ-Subsumption and Resolution: A New Algorithm. In: Zhong, N., Raś, Z.W., Tsumoto, S., Suzuki, E. (eds.) ISMIS 2003. LNCS (LNAI), vol. 2871, pp. 384–391. Springer, Heidelberg (2003)
Chapter Google Scholar
Hoos, H., Stützle, T.: Stochastic Local Search: Foundations & Applications. Morgan Kaufmann Publishers Inc., San Francisco (2004)
Google Scholar
Kieninger, T.: Table structure recognition based on robust block segmentation. In: Proc. Document Recognition V, vol. 3305, pp. 22–32. SPIE (1998)
Google Scholar
Kim, S., Liu, Y.: Functional-based table category identification in digital library. In: International Conference on Document Analysis and Recognition, pp. 1364–1368 (2011)
Google Scholar
Kramer, S., De Raedt, L.: Feature construction with version spaces for biochemical applications. In: Proceedings of the 18th International Conference on Machine Learning, pp. 258–265. Morgan Kaufmann Publishers Inc. (2001)
Google Scholar
Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Tableseer: Automatic table metadata extraction and searching in digital libraries categories and subject descriptors. In: Proceedings of JCDL 2007, pp. 91–100 (2007)
Google Scholar
Liu, Y., Mitra, P., Giles, C.: Identifying table boundaries in digital documents via sparse line detection. In: Proceedings of CIKM 2008 (2008)
Google Scholar
Nagy, G., Padmanabhan, R., Jandhyala, R.C., Silversmith, W., Krishnamoorthy, M.S.: Table metadata: Headers, augmentations and aggregates. In: Ninth IAPR International Workshop on Document Analysis Systems (2010)
Google Scholar
Nagy, G., Seth, S.C., Jin, D., Embley, D.W., Machado, S., Krishnamoorthy, M.S.: Data extraction from web tables: The devil is in the details. In: International Conference on Document Analysis and Recognition, pp. 242–246 (2011)
Google Scholar
Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. In: Proceedings of WWW, pp. 242–250 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, LACAM Laboratory, Università degli Studi di Bari “Aldo Moro”, Italy
Nicola Di Mauro, Stefano Ferilli & Floriana Esposito
Centro Interdipartimentale per la Logica e sue Applicazioni, Università degli Studi di Bari “Aldo Moro”, Italy
Nicola Di Mauro, Stefano Ferilli & Floriana Esposito

Authors

Nicola Di Mauro
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Ferilli
View author publications
You can also search for this author in PubMed Google Scholar
Floriana Esposito
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Engineering, University of Padua, Via Gradenigo, 6/a, 35131, Padua, Italy
Maristella Agosti & Nicola Ferro &
Department of Computer Science, University of Bari, Via E. Orabona, 4, 70126, Bari, Italy
Floriana Esposito & Stefano Ferilli &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Di Mauro, N., Ferilli, S., Esposito, F. (2013). Learning to Recognize Critical Cells in Document Tables. In: Agosti, M., Esposito, F., Ferilli, S., Ferro, N. (eds) Digital Libraries and Archives. IRCDL 2012. Communications in Computer and Information Science, vol 354. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35834-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-35834-0_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35833-3
Online ISBN: 978-3-642-35834-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics