Cell Classification for Layout Recognition in Spreadsheets

Koci, Elvis; Thiele, Maik; Romero, Oscar; Lehner, Wolfgang

doi:10.1007/978-3-319-99701-8_4

Cell Classification for Layout Recognition in Spreadsheets

Elvis Koci^14,15,
Maik Thiele¹⁴,
Oscar Romero¹⁵ &
…
Wolfgang Lehner¹⁴

Conference paper
First Online: 14 November 2018

813 Accesses
7 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 914))

Abstract

Spreadsheets compose a notably large and valuable dataset of documents within the enterprise settings and on the Web. Although spreadsheets are intuitive to use and equipped with powerful functionalities, extracting and reusing data from them remains a cumbersome and mostly manual task. Their greatest strength, the large degree of freedom they provide to the user, is at the same time also their greatest weakness, since data can be arbitrarily structured. Therefore, in this paper we propose a supervised learning approach for layout recognition in spreadsheets. We work on the cell level, aiming at predicting their correct layout role, out of five predefined alternatives. For this task we have considered a large number of features not covered before by related work. Moreover, we gather a considerably large dataset of annotated cells, from spreadsheets exhibiting variability in format and content. Our experiments, with five different classification algorithms, show that we can predict cell layout roles with high accuracy. Subsequently, in this paper we focus on revising the classification results, with the aim of repairing misclassifications. We propose a sophisticated approach, composed of three steps, which effectively corrects a reasonable number of inaccurate predictions.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Koci, E., Thiele, M., Romero, O., Lehner, W.: A machine learning approach for layout inference in spreadsheets. In: Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2016), KDIR, Porto, Portugal, 9–11 November 2016, vol. 1, pp. 77–88 (2016)
Google Scholar
Chen, Z., Cafarella, M.: Automatic web spreadsheet data extraction. In: SSW 2013, p. 1. ACM (2013)
Google Scholar
Barik, T., Lubick, K., Smith, J., Slankas, J., Murphy-Hill, E.: FUSE: a reproducible, extendable, internet-scale corpus of spreadsheets. In: MSR 2015 (2015)
Google Scholar
Hermans, F., Murphy-Hill, E.: Enron’s spreadsheets and related emails: a dataset and analysis. In: Proceedings of ICSE 2015. IEEE (2015)
Google Scholar
Fisher, M., Rothermel, G.: The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. In: SIGSOFT 2005, vol. 30, pp. 1–5. ACM (2005)
Article Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984)
MATH Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., Boston (1993)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article Google Scholar
Vapnik, V.: Estimation of Dependences Based on Empirical Data. Springer Series in Statistics. Springer-Verlag New York, Inc., New York (1982)
MATH Google Scholar
Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods - Support Vector Learning. MIT Press (1998)
Google Scholar
Chen, Z., Cafarella, M.: Integrating spreadsheet data via accurate and low-effort extraction. In: SIGKDD 2014, pp. 1126–1135. ACM (2014)
Google Scholar
Adelfio, M.D., Samet, H.: Schema extraction for tabular data on the web. In: VLDB 2013, vol. 6, pp. 421–432 (2013)
Article Google Scholar
Abraham, R., Erwig, M.: Header and unit inference for spreadsheets through spatial analyses. In: VL/HCC 2004, pp. 165–172. IEEE (2004)
Google Scholar
Eberius, J., Werner, C., Thiele, M., Braunschweig, K., Dannecker, L., Lehner, W.: DeExcelerator: a framework for extracting relational data from partially structured documents. In: CIKM 2013, pp. 2477–2480. ACM (2013)
Google Scholar
Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. In: WWW 2002, pp. 242–250. ACM (2002)
Google Scholar
Crestan, E., Pantel, P.: Web-scale table census and classification. In: WSDM 2011, pp. 545–554. ACM (2011)
Google Scholar
Eberius, J., Braunschweig, K., Hentsch, M., Thiele, M., Ahmadov, A., Lehner, W.: Building the dresden web table corpus: a classification approach. In: BDC 2015. IEEE/ACM (2015)
Google Scholar

Download references

Acknowledgments

This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate “Information Technologies for Business Intelligence - Doctoral College” (IT4BI-DC).

Author information

Authors and Affiliations

Database Technology Group, Department of Computer Science, Technische Universität Dresden, Dresden, Germany
Elvis Koci, Maik Thiele & Wolfgang Lehner
Departament d’Enginyeria de Serveis i Sistemes d’Informaciò, Universitat Politecnica de Catalunya, Barcelona, Spain
Elvis Koci & Oscar Romero

Authors

Elvis Koci
View author publications
You can also search for this author in PubMed Google Scholar
Maik Thiele
View author publications
You can also search for this author in PubMed Google Scholar
Oscar Romero
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Lehner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elvis Koci .

Editor information

Editors and Affiliations

Instituto de Telecomunicações, Lisbon, Portugal
Ana Fred
Department of Software Technology, Delft University of Technology, Voorburg, Zuid-Holland, The Netherlands
Jan Dietz
Faculty of Exact Sciences and Engineering, University of Madeira, Funchal, Portugal
David Aveiro
Henley Business School, University of Reading, Reading, UK
Kecheng Liu
University of Coimbra, Coimbra, Portugal
Jorge Bernardino
Instituto Politecnico de Setúbal (IPS), Setúbal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Koci, E., Thiele, M., Romero, O., Lehner, W. (2019). Cell Classification for Layout Recognition in Spreadsheets. In: Fred, A., Dietz, J., Aveiro, D., Liu, K., Bernardino, J., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2016. Communications in Computer and Information Science, vol 914. Springer, Cham. https://doi.org/10.1007/978-3-319-99701-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-99701-8_4
Published: 14 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99700-1
Online ISBN: 978-3-319-99701-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics