Hybrid Approach to Extracting Information from Web-Tables

Jung, Sung-won; Kang, Mi-young; Kwon, Hyuk-chul

doi:10.1007/11940098_11

Sung-won Jung^22,23,
Mi-young Kang^22,23 &
Hyuk-chul Kwon^22,23

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4285))

Included in the following conference series:

International Conference on Computer Processing of Oriental Languages

1026 Accesses
3 Citations

Abstract

This study concerns the extracting of information from tables in HTML documents. In our previous work, as a prerequisite for information extraction from tables in HTML, algorithms for separating meaningful tables and decorative tables were constructed, because only meaningful tables can be used to extract information and a preponderant proportion of decorative tables in training harms the learning result. In order to extract information, this study separated the head from the body in meaningful tables by extending the head extraction algorithm that was constructed in our previous work, using a machine learning algorithm, C4.5, and set up heuristics for table-schema extraction from meaningful tables by analyzing their head(s). In addition, table information in triples was extracted by determining the relation between the data and the extracted table schema. We obtained 71.2% accuracy in extracting table-schemata and information from the meaningful tables.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chen, H.H., Tsai, S.C., Tsai, J.H.: Mining Tables from Large Scale HTML Texts. In: Proceedings of 18th International Conference on Computational Linguistics, Saabrucken, Germany (July 2000)
Google Scholar
Hurst, M.: Layout and Language: Beyond Simple Text for Information Interaction - Modeling the Table. In: Proceedings of the 2nd International Conference on Multimodal Interfaces, Hong Kong (1999)
Google Scholar
Jung, S.W., Kwon, H.C.: A Scalable Hybrid Approach for Extracting Head Components from Web Tables. IEEE transaction on knowledge and data engineering 18(2) (accepted and to be appeared)
Google Scholar
Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper Induction for Information Extraction. In: 15th International Joint Conference on Artificial Intelligence (IJCAI 1997), Nagoya (August 1997)
Google Scholar
Ning, G., Guowen, W., Xiaoyuan, W., Baile, S.: Extracting web table information in cooperative learning activities based on abstract semantic model. In: Computer Supported Cooperative Work in Design, The Sixth International Conference, pp. 492–497 (2001)
Google Scholar
Wang, Y., Hu, J.: A Machine Learning Based Approach for Table Detection on The Web. In: Proceedings of The Eleventh International World Wide Web Conference WWW 2002, Sheraton Wailili Honolulu, Hawaii, USA, pp. 7–11 (2002)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Pub., San Francisco (2000)
Google Scholar
Yang, Y.: Web Table Mining and Database Discovery. M.Sc. thesis, Simon Fraser University (August 2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Korean Language Processing Laboratory, Department of Computer Science Engineering, Pusan National University,
Sung-won Jung, Mi-young Kang & Hyuk-chul Kwon
Center for U-Port IT Research and Education, Pusan National University, Jangjeon-dong, Geumjeong-gu, 609-735, Busan, Korea
Sung-won Jung, Mi-young Kang & Hyuk-chul Kwon

Authors

Sung-won Jung
View author publications
You can also search for this author in PubMed Google Scholar
Mi-young Kang
View author publications
You can also search for this author in PubMed Google Scholar
Hyuk-chul Kwon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate School of Information Science, Nara Institute of Science and Technology, 630-0192, Takayama, Ikoma, Nara, Japan
Yuji Matsumoto
Dept of ECE, University of Illinois at Urbana Champaign, IL 61801, Urbana, USA
Richard W. Sproat
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Kam-Fai Wong
State Key Lab of Intelligent Tech. & Sys., Tsinghua University,
Min Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jung, Sw., Kang, My., Kwon, Hc. (2006). Hybrid Approach to Extracting Information from Web-Tables. In: Matsumoto, Y., Sproat, R.W., Wong, KF., Zhang, M. (eds) Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead. ICCPOL 2006. Lecture Notes in Computer Science(), vol 4285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11940098_11

Download citation

DOI: https://doi.org/10.1007/11940098_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49667-0
Online ISBN: 978-3-540-49668-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics