Abstract
This study concerns the extracting of information from tables in HTML documents. In our previous work, as a prerequisite for information extraction from tables in HTML, algorithms for separating meaningful tables and decorative tables were constructed, because only meaningful tables can be used to extract information and a preponderant proportion of decorative tables in training harms the learning result. In order to extract information, this study separated the head from the body in meaningful tables by extending the head extraction algorithm that was constructed in our previous work, using a machine learning algorithm, C4.5, and set up heuristics for table-schema extraction from meaningful tables by analyzing their head(s). In addition, table information in triples was extracted by determining the relation between the data and the extracted table schema. We obtained 71.2% accuracy in extracting table-schemata and information from the meaningful tables.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chen, H.H., Tsai, S.C., Tsai, J.H.: Mining Tables from Large Scale HTML Texts. In: Proceedings of 18th International Conference on Computational Linguistics, Saabrucken, Germany (July 2000)
Hurst, M.: Layout and Language: Beyond Simple Text for Information Interaction - Modeling the Table. In: Proceedings of the 2nd International Conference on Multimodal Interfaces, Hong Kong (1999)
Jung, S.W., Kwon, H.C.: A Scalable Hybrid Approach for Extracting Head Components from Web Tables. IEEE transaction on knowledge and data engineering 18(2) (accepted and to be appeared)
Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper Induction for Information Extraction. In: 15th International Joint Conference on Artificial Intelligence (IJCAI 1997), Nagoya (August 1997)
Ning, G., Guowen, W., Xiaoyuan, W., Baile, S.: Extracting web table information in cooperative learning activities based on abstract semantic model. In: Computer Supported Cooperative Work in Design, The Sixth International Conference, pp. 492–497 (2001)
Wang, Y., Hu, J.: A Machine Learning Based Approach for Table Detection on The Web. In: Proceedings of The Eleventh International World Wide Web Conference WWW 2002, Sheraton Wailili Honolulu, Hawaii, USA, pp. 7–11 (2002)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Pub., San Francisco (2000)
Yang, Y.: Web Table Mining and Database Discovery. M.Sc. thesis, Simon Fraser University (August 2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jung, Sw., Kang, My., Kwon, Hc. (2006). Hybrid Approach to Extracting Information from Web-Tables. In: Matsumoto, Y., Sproat, R.W., Wong, KF., Zhang, M. (eds) Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead. ICCPOL 2006. Lecture Notes in Computer Science(), vol 4285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11940098_11
Download citation
DOI: https://doi.org/10.1007/11940098_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49667-0
Online ISBN: 978-3-540-49668-7
eBook Packages: Computer ScienceComputer Science (R0)