Skip to main content

Rule-Based Canonicalization of Arbitrary Tables in Spreadsheets

  • Conference paper
  • First Online:
Information and Software Technologies (ICIST 2016)

Abstract

Arbitrary tables presented in spreadsheets can be an important data source in business intelligence. However, many of them have complex layouts that hinder the process of extracting, transforming, and loading their data in a database. The paper is devoted to the issues of rule-based data transformation from arbitrary tables presented in spreadsheets to a structured canonical form that can be loaded into a database by regular ETL-tools. We propose a system for canonicalization of arbitrary tables presented in spreadsheets as an implementation of our methodology for rule-based table analysis and interpretation. It enables the execution of rules expressed in our specialized rule language called CRL to recover implicit relationships in a table. Our experimental results show that particular CRL-programs can be developed for different sets of tables with similar features to automate table canonicalization with high accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://yaml.org.

  2. 2.

    http://poi.apache.org.

  3. 3.

    http://snakeyaml.org.

  4. 4.

    http://www.oracle.com/technetwork/java/javase/tech/spec-136004.html.

  5. 5.

    http://jcp.org/en/jsr/detail?id=94.

  6. 6.

    http://drools.org.

  7. 7.

    Downloadable from https://github.com/shigarov/cells-ssdc.

  8. 8.

    Accessible at http://cells.icc.ru:8080/ssdc.

  9. 9.

    http://tango.byu.edu/data.

References

  1. Unstructured information management architecture (UIMA) version 1.0 (2009). http://docs.oasis-open.org/uima/v1.0/uima-v1.0.html

  2. Abraham, R., Erwig, M.: UCheck: A spreadsheet type checker for end users. J. Vis. Lang. Comput. 18(1), 71–95 (2007)

    Article  Google Scholar 

  3. Astrakhantsev, N., Turdakov, D., Vassilieva, N.: Semi-automatic data extraction from tables. In: Selected Papers of the 15th All-Russian Scientific Conference on Digital Libraries: Advanced Methods and Technologies, Digital Collections, pp. 14–20 (2013)

    Google Scholar 

  4. Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: WebTables: Exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)

    Article  Google Scholar 

  5. Chambers, C., Erwig, M.: Automatic detection of dimension errors in spreadsheets. J. Vis. Lang. Comput. 20(4), 269–283 (2009)

    Article  Google Scholar 

  6. Chen, Z., Cafarella, M.: Automatic web spreadsheet data extraction. In: Proceedings 3rd International Workshop on Semantic Search Over the Web, pp. 1: 1–1: 8. ACM, New York, NY, USA (2013)

    Google Scholar 

  7. Chen, Z., Cafarella, M.: Lntegrating spreadsheet data via accurate and low-effort extraction. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1126–1135. ACM, New York, NY, USA (2014)

    Google Scholar 

  8. Cunha, J., Saraiva, J.A., Visser, J.: From spreadsheets to relational databases and back. In: Proceedings ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation, pp. 179–188. ACM, New York, PEPM 2009, NY, USA (2009)

    Google Scholar 

  9. Embley, D.W., Krishnamoorthy, M.S., Nagy, G., Seth, S.: Converting heterogeneous statistical tables on the web to searchable databases. Int. J. Doc. Anal. Recogn. 19, 1–20 (2016)

    Article  Google Scholar 

  10. Embley, D.W., Seth, S., Nagy, G.: Transforming web tables to a relational database. In: Proceedings 22nd International Conference on Pattern Recognition, pp. 2781–2786. ICPR 2014, IEEE Comp. Soc., Washington, DC, USA (2014)

    Google Scholar 

  11. Embley, D., Tao, C., Liddle, S.: Automating the extraction of data from HTML tables with unknown structure. Data Knowl. Eng. 54(1), 3–28 (2005)

    Article  Google Scholar 

  12. Galkin, M., Mouromtsev, D., Auer, S.: Identifying web tables: Supporting a neglected type of content on the web. In: Proceedings of the 6th International Conference Knowledge Engineering and Semantic Web, Moscow, Russia. Communications in Computer and Information Science, vol. 518, pp. 48–62 (2015)

    Google Scholar 

  13. Gatterbauer, W., Bohunsky, P., Herzog, M., Krpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: Proceedings 16th International Conference on World Wide Web, pp. 71–80. New York, US (2007)

    Google Scholar 

  14. Govindaraju, V., Zhang, C., Ré, C.: Understanding tables in context using standard NLP toolkits. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL. vol. 2: Short Papers, pp. 658–664 (2013)

    Google Scholar 

  15. Hung, V.: Spreadsheet-Based Complex Data Transformation. Ph.D. thesis, School of Computer Science and Engineering, University of New South Wales, Sydney, Australia (2011)

    Google Scholar 

  16. Hung, V., Benatallah, B., Saint-Paul, R.: Spreadsheet-based complex data transformation. In: Proceedings 20th ACM International Conference on Information and Knowledge Management, pp. 1749–1754. ACM, New York, CIKM 2011, NY, USA (2011)

    Google Scholar 

  17. Kim, Y.S., Lee, K.H.: Extracting logical structures from html tables. Comput. Stand. Interfaces 30(5), 296–308 (2008)

    Article  Google Scholar 

  18. Kudinov, P.Y.: Extracting statistics indicators from tables of basic structure. Pattern Recogn. Image Anal. 21(4), 630–636 (2011)

    Article  Google Scholar 

  19. Nagy, G., Embley, D., Seth, S.: End-to-end conversion of html tables for populating a relational database. In: Proceedings 11th IAPR International Workshop on Document Analysis Systems, pp. 222–226. IEEE Computer Society, Tours Loire Valley, France, April 2014

    Google Scholar 

  20. Pivk, A., Cimiano, P., Sure, Y.: From tables to frames. Web Semant. 3(2–3), 132–146 (2005)

    Article  Google Scholar 

  21. Pivk, A.: Thesis: Automatic ontology generation from web tabular structures. AI Commun. 19(1), 83–85 (2006)

    MathSciNet  Google Scholar 

  22. Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovič, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(3), 567–595 (2007)

    Article  Google Scholar 

  23. Seth, S., Nagy, G.: Segmenting tables via indexing of value cells by table headers. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 887–891, August 2013

    Google Scholar 

  24. Shigarov, A.: Rule-based table analysis and interpretation. In: Proceedings of the 21st International Conference on Information and Software Technologies. Communications in Computer and Information Science, vol. 538, pp. 175–186 (2015)

    Google Scholar 

  25. Shigarov, A.: Table understanding using a rule engine. Expert Syst. Appl. 42(2), 929–937 (2015)

    Article  Google Scholar 

  26. Tijerino, Y., Embley, D., Lonsdale, D., Ding, Y., Nagy, G.: Towards ontology generation from tables. World Wide Web: Int. Web Inf. Syst. 8(3), 261–285 (2005)

    Article  Google Scholar 

  27. Wang, J., Wang, H., Wang, Z., Zhu, K.Q.: Understanding tables on the web. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 141–155. Springer, Heidelberg (2012). doi:10.1007/978-3-642-34002-4_11

    Chapter  Google Scholar 

  28. Wang, X.: Tabular Abstraction, Editing, and Formatting. Ph.D. thesis, University of Waterloo, Waterloo, Ontario, Canada (1996)

    Google Scholar 

Download references

Acknowledgements

We thank Prof. George Nagy and all members of TANGO research group(http://tango.byu.edu) for providing and discussing the TANGO dataset for our experiments.

This work was financially supported by the Russian Foundation for Basic Research (Grant No. 15-37-20042 and 14-07-00166) and Council for Grants of the President of Russian Federation (Grant No. NSh-8081.2016.9). The presented web-service for table canonicalization is performed on resources of the Shared Equipment Center of Integrated information and computing network of Irkutsk Research and Educational Complex(http://net.icc.ru).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexey O. Shigarov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Shigarov, A.O., Paramonov, V.V., Belykh, P.V., Bondarev, A.I. (2016). Rule-Based Canonicalization of Arbitrary Tables in Spreadsheets. In: Dregvaite, G., Damasevicius, R. (eds) Information and Software Technologies. ICIST 2016. Communications in Computer and Information Science, vol 639. Springer, Cham. https://doi.org/10.1007/978-3-319-46254-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46254-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46253-0

  • Online ISBN: 978-3-319-46254-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics