Abstract
This study aims to separate the head from the data in web-tables to extract useful information. To achieve this aim, web-tables must be converted into a machine readable form, an attribute-value pair, the relation of which is similar to that of head-body. We have separated meaningful tables and decorative tables in our previous work, because web-tables are used for the purpose of knowledge structuring as well as document design, and only meaningful tables can be used to extract information. In order to extract the semantic relations existing between language contents in a meaningful table, this study separated the head from the body in meaningful tables using machine learning. We (a) established features observing the editing habit of authors and tables themselves, and (b) established a model using machine learning algorithm, C4.5 in order to separate the head from the body. We obtained 86.2% accuracy in extracting the head from the meaningful tables.
This work was supported by the Regional Research Centers Program(Research Center for Logistics Information Technology), granted by the Korean Ministry of Education & Human Resources Development.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chen, H.H., Tsai, S.C., Tsai, J.H.: Mining Tables from Large Scale HTML Texts. In: Proceedings of 18th International Conference on Computational Linguistics, Saabrucken, Germany (July 2000)
Hurst, M.: Layout and Language: Beyond Simple Text for Information Interaction - Modeling the Table. In: Proceedings of the 2nd International Conference on Multimodal Interfaces, Hong Kong (1999)
Jung, S.W., Park, D.W., Kwon, H.C.: Extracting Web-Table Information Using Decision Tree and Rule Based Approach. In: Asia Information Retrieval Symposium, China, October 2004, pp. 281–284 (2004)
Ning, G., Guowen, W., Xiaoyuan, W., Baile, S.: Extracting web table information in cooperative learning activities based on abstract semantic model. In: The Sixth International Conference Computer Supported Cooperative Work in Design, pp. 492–497 (2001)
Wang, Y., Hu, J.: A Machine Learning Based Approach for Table Detection on The Web. In: Proceedings of The Eleventh International World Wide Web Conference WWW2002, Sheraton Wailili Honolulu, Hawaii, USA, pp. 7–11 (2002)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Pub., San Francisco (2000)
Yang, Y.: Web Table Mining and Database Discovery. M.Sc. thesis, Simon Fraser University (August 2002)
Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper Induction for Information Extraction. In: 15th International Joint Conference on Artificial Intelligence (IJCAI 1997), Nagoya (August 1997)
Jung, S.W., Kwon, H.C.: A Scalable Hybrid Approach for Extracting Head Components from Web Tables. IEEE transaction on knowledge and data engineering 18(2) (accepted and to be appeared)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jung, SW., Kwon, HC. (2006). A Machine Learning Based Approach for Separating Head from Body in Web-Tables. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2006. Lecture Notes in Computer Science, vol 3878. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671299_54
Download citation
DOI: https://doi.org/10.1007/11671299_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32205-4
Online ISBN: 978-3-540-32206-1
eBook Packages: Computer ScienceComputer Science (R0)