Skip to main content

Extracting Knowledge from Web Tables Based on DOM Tree Similarity

  • Conference paper
  • First Online:
Knowledge Science, Engineering and Management (KSEM 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9983))

Abstract

Structured (semi-structured) knowledge extraction from Web tables is an important way to obtain high quality knowledge. Unlike most extraction methods which need to understand the tables with external knowledge bases, our method uses the inherent similarities of tables to determine the semantic structure of tables. With a comprehensive analysis of table structures of various forms, we provide a novel way for calculating the DOM tree similarity between various web tables based on DTW and for clustering tables. By using 5000 Wikipedia tables which were extracted at random as the corpus, experiments show that the result of table clustering is close to the result of classification based on empirical approaches, and without the use of external knowledge bases, the quality of knowledge extracted from the tables is satisfactory.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://cn.mathworks.com/help/stats/examples/non-classical-multidimensional-scaling.html.

References

  1. Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: WebTables: exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)

    Article  Google Scholar 

  2. Crestan, E., Pantel, P.: Web-scale table census and classification. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 545–554. ACM, New York (2011)

    Google Scholar 

  3. Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. In: Proceedings of the 11th International Conference on World Wide Web, pp. 242–250. ACM, New York (2002)

    Google Scholar 

  4. Son, J.W., Lee, J.A., Park, S.B., Song, H.J., Lee, S.J., Park, S.Y.: Discriminating meaningful web tables from decorative tables using a composite kernel. In: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 1, pp. 368–371. IEEE Computer Society (2008)

    Google Scholar 

  5. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  6. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: International Conference on World Wide Web, vol. 272, pp. 181–221. ACM, New York (2007)

    Google Scholar 

  7. Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., et al.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–610. ACM, New York (2014)

    Google Scholar 

  8. Wang, J., Wang, H., Wang, Z., Zhu, K.Q.: Understanding tables on the web. In: Atzeni, P., Cheung, D., Ram, S. (eds.) ER 2012 Main Conference 2012. LNCS, vol. 7532, pp. 141–155. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  9. Nagy, G.: Learning the characteristics of critical cells from web tables. In: International Conference on Pattern Recognition, pp. 1554–1557. IEEE (2012)

    Google Scholar 

  10. Dalvi, B.B., Cohen, W.W., Callan, J.: WebSets: extracting sets of entities from the web using unsupervised information extraction. In: ACM International Conference on Web Search and Data Mining, pp. 243–252. ACM, New York (2013)

    Google Scholar 

  11. Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3(3), 1338–1347 (2010)

    Article  Google Scholar 

  12. Oz, E., Hogan, A., Mileo, A.: Using linked data to mine RDF from Wikipedia’s tables. In: ACM International Conference on Web Search and Data Mining, pp. 533–542. ACM, New York (2014)

    Google Scholar 

  13. Chen, H.H., Tsai, S.C., Tsai, J.H.: Mining tables from large scale HTML texts. In: Conference on Computational Linguistics, pp. 166–172. ACL, Stroudsburg (2000)

    Google Scholar 

  14. Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovič, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(3), 567–595 (2007)

    Article  Google Scholar 

  15. Pivk, A., Cimiano, P., Sure, Y.: From tables to frames. Web Semant. Sci. Serv. Agents World Wide Web 3(2–3), 132–146 (2005)

    Article  Google Scholar 

  16. Wang, Y., Phillips, I.T., Haralick, R.M.: Table structure understanding and its performance evaluation. Pattern Recogn. 37(7), 1479–1497 (2004)

    Article  Google Scholar 

  17. Bhagavatula, C.S., Noraset, T., Downey, D.: Methods for exploring and mining tables on Wikipedia. In: ACM SIGKDD Workshop on Interactive Data Exploration and Analytics, pp. 18–26. ACM, New York (2013)

    Google Scholar 

  18. Adelfio, M.D., Samet, H.: Schema extraction for tabular data on the web. Proc. VLDB Endow. 6(6), 421–432 (2013)

    Article  Google Scholar 

  19. Govindaraju, V., Zhang, C., Ré, C.: Understanding tables in context using standard NLP toolkits. In: Meeting of the Association for Computational Linguistics, vol. 2, pp. 658–664. ACL (2013)

    Google Scholar 

  20. Lautert, L.R., Scheidt, M.M., Dorneles, C.F.: Web table taxonomy and formalization. ACM SIGMOD Rec. 42(3), 28–33 (2013)

    Article  Google Scholar 

  21. Tai, K.C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  22. Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: Proceedings of the Seventh International Symposium on String Processing Information Retrieval, pp. 39–48. IEEE Computer Society, Washington, DC (2000)

    Google Scholar 

  23. Yang, W.: Identifying syntactic differences between two programs. Softw. Pract. Exp. 21(7), 739–755 (1991)

    Article  Google Scholar 

  24. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Sig. Process. 26(1), 43–49 (1978)

    Article  MATH  Google Scholar 

  25. Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)

    Article  Google Scholar 

Download references

Acknowledgment

This work is supported by the National Science Foundation of China (under grant Nos. 91224006 and 61173063) and the Ministry of Science and Technology (under grant No. 201303107).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaolong Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Wu, X., Cao, C., Wang, Y., Fu, J., Wang, S. (2016). Extracting Knowledge from Web Tables Based on DOM Tree Similarity. In: Lehner, F., Fteimi, N. (eds) Knowledge Science, Engineering and Management. KSEM 2016. Lecture Notes in Computer Science(), vol 9983. Springer, Cham. https://doi.org/10.1007/978-3-319-47650-6_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-47650-6_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-47649-0

  • Online ISBN: 978-3-319-47650-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics