Extracting Knowledge from Web Tables Based on DOM Tree Similarity

Wu, Xiaolong; Cao, Cungen; Wang, Ya; Fu, Jianhui; Wang, Shi

doi:10.1007/978-3-319-47650-6_24

Xiaolong Wu^15,16,
Cungen Cao¹⁵,
Ya Wang¹⁵,
Jianhui Fu¹⁵ &
…
Shi Wang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9983))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1793 Accesses
3 Citations

Abstract

Structured (semi-structured) knowledge extraction from Web tables is an important way to obtain high quality knowledge. Unlike most extraction methods which need to understand the tables with external knowledge bases, our method uses the inherent similarities of tables to determine the semantic structure of tables. With a comprehensive analysis of table structures of various forms, we provide a novel way for calculating the DOM tree similarity between various web tables based on DTW and for clustering tables. By using 5000 Wikipedia tables which were extracted at random as the corpus, experiments show that the result of table clustering is close to the result of classification based on empirical approaches, and without the use of external knowledge bases, the quality of knowledge extracted from the tables is satisfactory.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://cn.mathworks.com/help/stats/examples/non-classical-multidimensional-scaling.html.

References

Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: WebTables: exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)
Article Google Scholar
Crestan, E., Pantel, P.: Web-scale table census and classification. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 545–554. ACM, New York (2011)
Google Scholar
Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. In: Proceedings of the 11th International Conference on World Wide Web, pp. 242–250. ACM, New York (2002)
Google Scholar
Son, J.W., Lee, J.A., Park, S.B., Song, H.J., Lee, S.J., Park, S.Y.: Discriminating meaningful web tables from decorative tables using a composite kernel. In: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 1, pp. 368–371. IEEE Computer Society (2008)
Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)
Chapter Google Scholar
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: International Conference on World Wide Web, vol. 272, pp. 181–221. ACM, New York (2007)
Google Scholar
Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., et al.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–610. ACM, New York (2014)
Google Scholar
Wang, J., Wang, H., Wang, Z., Zhu, K.Q.: Understanding tables on the web. In: Atzeni, P., Cheung, D., Ram, S. (eds.) ER 2012 Main Conference 2012. LNCS, vol. 7532, pp. 141–155. Springer, Heidelberg (2012)
Chapter Google Scholar
Nagy, G.: Learning the characteristics of critical cells from web tables. In: International Conference on Pattern Recognition, pp. 1554–1557. IEEE (2012)
Google Scholar
Dalvi, B.B., Cohen, W.W., Callan, J.: WebSets: extracting sets of entities from the web using unsupervised information extraction. In: ACM International Conference on Web Search and Data Mining, pp. 243–252. ACM, New York (2013)
Google Scholar
Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3(3), 1338–1347 (2010)
Article Google Scholar
Oz, E., Hogan, A., Mileo, A.: Using linked data to mine RDF from Wikipedia’s tables. In: ACM International Conference on Web Search and Data Mining, pp. 533–542. ACM, New York (2014)
Google Scholar
Chen, H.H., Tsai, S.C., Tsai, J.H.: Mining tables from large scale HTML texts. In: Conference on Computational Linguistics, pp. 166–172. ACL, Stroudsburg (2000)
Google Scholar
Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovič, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(3), 567–595 (2007)
Article Google Scholar
Pivk, A., Cimiano, P., Sure, Y.: From tables to frames. Web Semant. Sci. Serv. Agents World Wide Web 3(2–3), 132–146 (2005)
Article Google Scholar
Wang, Y., Phillips, I.T., Haralick, R.M.: Table structure understanding and its performance evaluation. Pattern Recogn. 37(7), 1479–1497 (2004)
Article Google Scholar
Bhagavatula, C.S., Noraset, T., Downey, D.: Methods for exploring and mining tables on Wikipedia. In: ACM SIGKDD Workshop on Interactive Data Exploration and Analytics, pp. 18–26. ACM, New York (2013)
Google Scholar
Adelfio, M.D., Samet, H.: Schema extraction for tabular data on the web. Proc. VLDB Endow. 6(6), 421–432 (2013)
Article Google Scholar
Govindaraju, V., Zhang, C., Ré, C.: Understanding tables in context using standard NLP toolkits. In: Meeting of the Association for Computational Linguistics, vol. 2, pp. 658–664. ACL (2013)
Google Scholar
Lautert, L.R., Scheidt, M.M., Dorneles, C.F.: Web table taxonomy and formalization. ACM SIGMOD Rec. 42(3), 28–33 (2013)
Article Google Scholar
Tai, K.C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)
Article MathSciNet MATH Google Scholar
Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: Proceedings of the Seventh International Symposium on String Processing Information Retrieval, pp. 39–48. IEEE Computer Society, Washington, DC (2000)
Google Scholar
Yang, W.: Identifying syntactic differences between two programs. Softw. Pract. Exp. 21(7), 739–755 (1991)
Article Google Scholar
Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Sig. Process. 26(1), 43–49 (1978)
Article MATH Google Scholar
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)
Article Google Scholar

Download references

Acknowledgment

This work is supported by the National Science Foundation of China (under grant Nos. 91224006 and 61173063) and the Ministry of Science and Technology (under grant No. 201303107).

Author information

Authors and Affiliations

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xiaolong Wu, Cungen Cao, Ya Wang, Jianhui Fu & Shi Wang
University of Chinese Academy of Sciences, Beijing, China
Xiaolong Wu

Authors

Xiaolong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Cungen Cao
View author publications
You can also search for this author in PubMed Google Scholar
Ya Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianhui Fu
View author publications
You can also search for this author in PubMed Google Scholar
Shi Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaolong Wu .

Editor information

Editors and Affiliations

University of Passau, Passau, Germany
Franz Lehner
University of Passau , Passau, Germany
Nora Fteimi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, X., Cao, C., Wang, Y., Fu, J., Wang, S. (2016). Extracting Knowledge from Web Tables Based on DOM Tree Similarity. In: Lehner, F., Fteimi, N. (eds) Knowledge Science, Engineering and Management. KSEM 2016. Lecture Notes in Computer Science(), vol 9983. Springer, Cham. https://doi.org/10.1007/978-3-319-47650-6_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-47650-6_24
Published: 05 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47649-0
Online ISBN: 978-3-319-47650-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics