Abstract
Arbitrary tables presented in spreadsheets can be an important data source in business intelligence. However, many of them have complex layouts that hinder the process of extracting, transforming, and loading their data in a database. The paper is devoted to the issues of rule-based data transformation from arbitrary tables presented in spreadsheets to a structured canonical form that can be loaded into a database by regular ETL-tools. We propose a system for canonicalization of arbitrary tables presented in spreadsheets as an implementation of our methodology for rule-based table analysis and interpretation. It enables the execution of rules expressed in our specialized rule language called CRL to recover implicit relationships in a table. Our experimental results show that particular CRL-programs can be developed for different sets of tables with similar features to automate table canonicalization with high accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
Downloadable from https://github.com/shigarov/cells-ssdc.
- 8.
Accessible at http://cells.icc.ru:8080/ssdc.
- 9.
References
Unstructured information management architecture (UIMA) version 1.0 (2009). http://docs.oasis-open.org/uima/v1.0/uima-v1.0.html
Abraham, R., Erwig, M.: UCheck: A spreadsheet type checker for end users. J. Vis. Lang. Comput. 18(1), 71–95 (2007)
Astrakhantsev, N., Turdakov, D., Vassilieva, N.: Semi-automatic data extraction from tables. In: Selected Papers of the 15th All-Russian Scientific Conference on Digital Libraries: Advanced Methods and Technologies, Digital Collections, pp. 14–20 (2013)
Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: WebTables: Exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)
Chambers, C., Erwig, M.: Automatic detection of dimension errors in spreadsheets. J. Vis. Lang. Comput. 20(4), 269–283 (2009)
Chen, Z., Cafarella, M.: Automatic web spreadsheet data extraction. In: Proceedings 3rd International Workshop on Semantic Search Over the Web, pp. 1: 1–1: 8. ACM, New York, NY, USA (2013)
Chen, Z., Cafarella, M.: Lntegrating spreadsheet data via accurate and low-effort extraction. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1126–1135. ACM, New York, NY, USA (2014)
Cunha, J., Saraiva, J.A., Visser, J.: From spreadsheets to relational databases and back. In: Proceedings ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation, pp. 179–188. ACM, New York, PEPM 2009, NY, USA (2009)
Embley, D.W., Krishnamoorthy, M.S., Nagy, G., Seth, S.: Converting heterogeneous statistical tables on the web to searchable databases. Int. J. Doc. Anal. Recogn. 19, 1–20 (2016)
Embley, D.W., Seth, S., Nagy, G.: Transforming web tables to a relational database. In: Proceedings 22nd International Conference on Pattern Recognition, pp. 2781–2786. ICPR 2014, IEEE Comp. Soc., Washington, DC, USA (2014)
Embley, D., Tao, C., Liddle, S.: Automating the extraction of data from HTML tables with unknown structure. Data Knowl. Eng. 54(1), 3–28 (2005)
Galkin, M., Mouromtsev, D., Auer, S.: Identifying web tables: Supporting a neglected type of content on the web. In: Proceedings of the 6th International Conference Knowledge Engineering and Semantic Web, Moscow, Russia. Communications in Computer and Information Science, vol. 518, pp. 48–62 (2015)
Gatterbauer, W., Bohunsky, P., Herzog, M., Krpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: Proceedings 16th International Conference on World Wide Web, pp. 71–80. New York, US (2007)
Govindaraju, V., Zhang, C., Ré, C.: Understanding tables in context using standard NLP toolkits. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL. vol. 2: Short Papers, pp. 658–664 (2013)
Hung, V.: Spreadsheet-Based Complex Data Transformation. Ph.D. thesis, School of Computer Science and Engineering, University of New South Wales, Sydney, Australia (2011)
Hung, V., Benatallah, B., Saint-Paul, R.: Spreadsheet-based complex data transformation. In: Proceedings 20th ACM International Conference on Information and Knowledge Management, pp. 1749–1754. ACM, New York, CIKM 2011, NY, USA (2011)
Kim, Y.S., Lee, K.H.: Extracting logical structures from html tables. Comput. Stand. Interfaces 30(5), 296–308 (2008)
Kudinov, P.Y.: Extracting statistics indicators from tables of basic structure. Pattern Recogn. Image Anal. 21(4), 630–636 (2011)
Nagy, G., Embley, D., Seth, S.: End-to-end conversion of html tables for populating a relational database. In: Proceedings 11th IAPR International Workshop on Document Analysis Systems, pp. 222–226. IEEE Computer Society, Tours Loire Valley, France, April 2014
Pivk, A., Cimiano, P., Sure, Y.: From tables to frames. Web Semant. 3(2–3), 132–146 (2005)
Pivk, A.: Thesis: Automatic ontology generation from web tabular structures. AI Commun. 19(1), 83–85 (2006)
Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovič, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(3), 567–595 (2007)
Seth, S., Nagy, G.: Segmenting tables via indexing of value cells by table headers. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 887–891, August 2013
Shigarov, A.: Rule-based table analysis and interpretation. In: Proceedings of the 21st International Conference on Information and Software Technologies. Communications in Computer and Information Science, vol. 538, pp. 175–186 (2015)
Shigarov, A.: Table understanding using a rule engine. Expert Syst. Appl. 42(2), 929–937 (2015)
Tijerino, Y., Embley, D., Lonsdale, D., Ding, Y., Nagy, G.: Towards ontology generation from tables. World Wide Web: Int. Web Inf. Syst. 8(3), 261–285 (2005)
Wang, J., Wang, H., Wang, Z., Zhu, K.Q.: Understanding tables on the web. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 141–155. Springer, Heidelberg (2012). doi:10.1007/978-3-642-34002-4_11
Wang, X.: Tabular Abstraction, Editing, and Formatting. Ph.D. thesis, University of Waterloo, Waterloo, Ontario, Canada (1996)
Acknowledgements
We thank Prof. George Nagy and all members of TANGO research group(http://tango.byu.edu) for providing and discussing the TANGO dataset for our experiments.
This work was financially supported by the Russian Foundation for Basic Research (Grant No. 15-37-20042 and 14-07-00166) and Council for Grants of the President of Russian Federation (Grant No. NSh-8081.2016.9). The presented web-service for table canonicalization is performed on resources of the Shared Equipment Center of Integrated information and computing network of Irkutsk Research and Educational Complex(http://net.icc.ru).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Shigarov, A.O., Paramonov, V.V., Belykh, P.V., Bondarev, A.I. (2016). Rule-Based Canonicalization of Arbitrary Tables in Spreadsheets. In: Dregvaite, G., Damasevicius, R. (eds) Information and Software Technologies. ICIST 2016. Communications in Computer and Information Science, vol 639. Springer, Cham. https://doi.org/10.1007/978-3-319-46254-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-46254-7_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46253-0
Online ISBN: 978-3-319-46254-7
eBook Packages: Computer ScienceComputer Science (R0)