Skip to main content

Learning to Recognize Critical Cells in Document Tables

  • Conference paper

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 354))

Abstract

Tables are among the most informative components of documents, because they are exploited to compactly and intuitively represent data, typically for understandability purposes. The needs are to identify and extract tables from documents, and, on the other hand, to be able to extract the data they contain. The latter task involves the understanding of a table structure. Due to the variability in style, size, and aims of tables, algorithmic approaches to this task can be insufficient, and the exploitation of machine learning systems may represent an effective solution. This paper proposes the exploitation of a first-order logic representation, that is able to capture the complex spatial relationships involved in a table structure, and of a learning system that can mix the power of this representation with the flexibility of statistical approaches. The obtained encouraging results suggest further investigation and refinement of the proposal.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the International Conference on Data Engineering, pp. 3–14 (1995)

    Google Scholar 

  2. Cafarella, M., Halevy, A., Wang, Z., Wu, E., Zhang, Y.: Webtables: Exploring the power of tables on the web. In: Proceddings of VLDB (2008)

    Google Scholar 

  3. Di Mauro, N., Basile, T.M.A., Ferilli, S., Esposito, F.: Optimizing Probabilistic Models for Relational Sequence Learning. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS, vol. 6804, pp. 240–249. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  4. Esposito, F., Di Mauro, N., Basile, T., Ferilli, S.: Multi-dimensional relational sequence mining. Fundamenta Informaticae 89(1), 23–43 (2008)

    MATH  Google Scholar 

  5. Esposito, F., Ferilli, S., Basile, T.M., Di Mauro, N.: Machine learning for digital document processing: From layout analysis to metadata extraction. In: Marinai, S., Fujisawa, H. (eds.) Machine Learning in Document Analysis and Recognition. SCI, vol. 90, pp. 105–138. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  6. Feo, T., Resende, M.: Greedy randomized adaptive search procedures. Journal of Global Optimization 6, 109–133 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  7. Ferilli, S., Di Mauro, N., Basile, T.M.A., Esposito, F.: θ-Subsumption and Resolution: A New Algorithm. In: Zhong, N., Raś, Z.W., Tsumoto, S., Suzuki, E. (eds.) ISMIS 2003. LNCS (LNAI), vol. 2871, pp. 384–391. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  8. Hoos, H., Stützle, T.: Stochastic Local Search: Foundations & Applications. Morgan Kaufmann Publishers Inc., San Francisco (2004)

    Google Scholar 

  9. Kieninger, T.: Table structure recognition based on robust block segmentation. In: Proc. Document Recognition V, vol. 3305, pp. 22–32. SPIE (1998)

    Google Scholar 

  10. Kim, S., Liu, Y.: Functional-based table category identification in digital library. In: International Conference on Document Analysis and Recognition, pp. 1364–1368 (2011)

    Google Scholar 

  11. Kramer, S., De Raedt, L.: Feature construction with version spaces for biochemical applications. In: Proceedings of the 18th International Conference on Machine Learning, pp. 258–265. Morgan Kaufmann Publishers Inc. (2001)

    Google Scholar 

  12. Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Tableseer: Automatic table metadata extraction and searching in digital libraries categories and subject descriptors. In: Proceedings of JCDL 2007, pp. 91–100 (2007)

    Google Scholar 

  13. Liu, Y., Mitra, P., Giles, C.: Identifying table boundaries in digital documents via sparse line detection. In: Proceedings of CIKM 2008 (2008)

    Google Scholar 

  14. Nagy, G., Padmanabhan, R., Jandhyala, R.C., Silversmith, W., Krishnamoorthy, M.S.: Table metadata: Headers, augmentations and aggregates. In: Ninth IAPR International Workshop on Document Analysis Systems (2010)

    Google Scholar 

  15. Nagy, G., Seth, S.C., Jin, D., Embley, D.W., Machado, S., Krishnamoorthy, M.S.: Data extraction from web tables: The devil is in the details. In: International Conference on Document Analysis and Recognition, pp. 242–246 (2011)

    Google Scholar 

  16. Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. In: Proceedings of WWW, pp. 242–250 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Di Mauro, N., Ferilli, S., Esposito, F. (2013). Learning to Recognize Critical Cells in Document Tables. In: Agosti, M., Esposito, F., Ferilli, S., Ferro, N. (eds) Digital Libraries and Archives. IRCDL 2012. Communications in Computer and Information Science, vol 354. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35834-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35834-0_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35833-3

  • Online ISBN: 978-3-642-35834-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics