Rule Learning for Feature Values Extraction from HTML Product Information Sheets

Bădică, Costin; Bădică, Amelia

doi:10.1007/978-3-540-30504-0_4

Costin Bădică¹⁸ &
Amelia Bădică¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3323))

Included in the following conference series:

International Workshop on Rules and Rule Markup Languages for the Semantic Web

376 Accesses
4 Citations

Abstract

The Web is now a huge information repository with a rich semantic structure that, however, is primarily addressed to human understanding rather than automated processing by a computer. The problem of collecting product information from the Web and organizing it in an appropriate way for automated machine processing is a primary task of software shopping agents and has received a lot of attention during the last years. In this paper we assume that product information is represented as a set of feature-value pairs contained in an HTML product information sheet that is usually formatted using HTML tables. The paper presents a technique for learning extraction rules of product information from such product information sheets. The technique exploits the fact that the Web pages that represent product information of a certain producer are generated on the fly from the producer database and therefore they exhibit uniform structures. Consequently, while the extraction task is executed manually for a few information items by a human user, a general-purpose inductive learner (we have used FOIL in our experiments) can learn extraction rules that will be further applied to the current and other product information sheets to automatically extract other items. The input to the learning algorithm is a relational description of the HTML document tree that defines the HTML tree nodes types and the relationships between them. The approach is demonstrated with appropriate examples, experimental results, and software tools.

The research described here was partly supported with funding from Syncro Soft http://www.oxygenxml.com

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bădică, C., Bădică, A., Liţoiu, V.: Enhancing WWW E-Commerce by Acquiring and Managing Product Knowledge. In: Proceedings of TAINN 2003, Çnakkale, Turkey, vol. E-7, pp. 684–692 (2003)
Google Scholar
Chakrabarti, S.: Mining the Web. In: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2003)
Google Scholar
Chidlovskii, B.: Information Extraction from Tree Documents by Learning Subtree Delimiters. In: Proceedings of IJCAI 2003 Workshop on Information Integration on the Web (IIWeb 2003), Acapulco, Mexico, pp. 3–8 (2003)
Google Scholar
Fensel, D., Ding, Y., Omelayenko, B., Schulten, E., Botquin, G., Brown, M., Flet, A.: Product Data Integration. IEEE Intelligent Systems 16(4), 54–59 (2001)
Article Google Scholar
Freitag, D.: Information extraction from HTML: application of a general machine learning approach. In: Proceedings of AAAI 1998, pp. 517–523 (1998)
Google Scholar
Kosala, R., van den Bussche, J., Bruynooghe, M., Blockeel, H.: Information Extraction in Structured Documents using Tree Automata Induction. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 299–310. Springer, Heidelberg (2002)
Chapter Google Scholar
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence (118), 15–68 (2000)
Article MATH MathSciNet Google Scholar
Lenhert, W., Sundheim, B.: A Performance Evaluation of Text-Analysis Technologies. AI Magazine 12(3), 81–94 (1991)
Google Scholar
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Muşlea, I.: Extraction Patterns for Information Extraction Tasks: A Survey. In: AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)
Google Scholar
Neven, F.: Automata Theory for XML Researchers. SIGMOD Record 31(3), 39–46 (2002)
Article Google Scholar
Quinlan, J.R., Cameron-Jones, R.M.: Induction of Logic Programs: FOIL and Related Systems. New Generation Computing 13, 287–312 (1995)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Software Engineering Department, University of Craiova, Bvd.Decebal 107, Craiova, 200440, Romania
Costin Bădică
Business Information Systems Department, University of Craiova, A.I.Cuza 13, Craiova, RO-1100, Romania
Amelia Bădică

Authors

Costin Bădică
View author publications
You can also search for this author in PubMed Google Scholar
Amelia Bădică
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department of University of Crete, Greece
Grigoris Antoniou
Institute for Information Technology – e-Business, National Research Council of Canada, E3B 9W4, Fredericton, NB, Canada
Harold Boley

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bădică, C., Bădică, A. (2004). Rule Learning for Feature Values Extraction from HTML Product Information Sheets. In: Antoniou, G., Boley, H. (eds) Rules and Rule Markup Languages for the Semantic Web. RuleML 2004. Lecture Notes in Computer Science, vol 3323. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30504-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-540-30504-0_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23842-3
Online ISBN: 978-3-540-30504-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics