A Machine Learning Based Approach for Separating Head from Body in Web-Tables

Jung, Sung-Won; Kwon, Hyuk-Chul

doi:10.1007/11671299_54

Sung-Won Jung¹⁷ &
Hyuk-Chul Kwon¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3878))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1407 Accesses
3 Citations

Abstract

This study aims to separate the head from the data in web-tables to extract useful information. To achieve this aim, web-tables must be converted into a machine readable form, an attribute-value pair, the relation of which is similar to that of head-body. We have separated meaningful tables and decorative tables in our previous work, because web-tables are used for the purpose of knowledge structuring as well as document design, and only meaningful tables can be used to extract information. In order to extract the semantic relations existing between language contents in a meaningful table, this study separated the head from the body in meaningful tables using machine learning. We (a) established features observing the editing habit of authors and tables themselves, and (b) established a model using machine learning algorithm, C4.5 in order to separate the head from the body. We obtained 86.2% accuracy in extracting the head from the meaningful tables.

This work was supported by the Regional Research Centers Program(Research Center for Logistics Information Technology), granted by the Korean Ministry of Education & Human Resources Development.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chen, H.H., Tsai, S.C., Tsai, J.H.: Mining Tables from Large Scale HTML Texts. In: Proceedings of 18th International Conference on Computational Linguistics, Saabrucken, Germany (July 2000)
Google Scholar
Hurst, M.: Layout and Language: Beyond Simple Text for Information Interaction - Modeling the Table. In: Proceedings of the 2nd International Conference on Multimodal Interfaces, Hong Kong (1999)
Google Scholar
Jung, S.W., Park, D.W., Kwon, H.C.: Extracting Web-Table Information Using Decision Tree and Rule Based Approach. In: Asia Information Retrieval Symposium, China, October 2004, pp. 281–284 (2004)
Google Scholar
Ning, G., Guowen, W., Xiaoyuan, W., Baile, S.: Extracting web table information in cooperative learning activities based on abstract semantic model. In: The Sixth International Conference Computer Supported Cooperative Work in Design, pp. 492–497 (2001)
Google Scholar
Wang, Y., Hu, J.: A Machine Learning Based Approach for Table Detection on The Web. In: Proceedings of The Eleventh International World Wide Web Conference WWW2002, Sheraton Wailili Honolulu, Hawaii, USA, pp. 7–11 (2002)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Pub., San Francisco (2000)
Google Scholar
Yang, Y.: Web Table Mining and Database Discovery. M.Sc. thesis, Simon Fraser University (August 2002)
Google Scholar
Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper Induction for Information Extraction. In: 15th International Joint Conference on Artificial Intelligence (IJCAI 1997), Nagoya (August 1997)
Google Scholar
Jung, S.W., Kwon, H.C.: A Scalable Hybrid Approach for Extracting Head Components from Web Tables. IEEE transaction on knowledge and data engineering 18(2) (accepted and to be appeared)
Google Scholar

Download references

Author information

Authors and Affiliations

Korean Language Processing Lab., Department of Computer Science and Engineering, Pusan National University, Busan, Korea
Sung-Won Jung & Hyuk-Chul Kwon

Authors

Sung-Won Jung
View author publications
You can also search for this author in PubMed Google Scholar
Hyuk-Chul Kwon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jung, SW., Kwon, HC. (2006). A Machine Learning Based Approach for Separating Head from Body in Web-Tables. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2006. Lecture Notes in Computer Science, vol 3878. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671299_54

Download citation

DOI: https://doi.org/10.1007/11671299_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32205-4
Online ISBN: 978-3-540-32206-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics