Skip to main content

A New Path Generalization Algorithm for HTML Wrapper Induction

  • Chapter
Advances in Web Intelligence and Data Mining

Part of the book series: Studies in Computational Intelligence ((SCI,volume 23))

Summary

Recently it was shown that Inductive Logic Programming can be successfully applied to data extraction from HTML. However, the approach suffers from two problems: high computational complexity with respect to the number of nodes of the target document and to the arity of the extracted tuples. In this note we address the first problem by proposing an efficient path generalization algorithm for learning rules to extract single information items. The presentation is supplemented with a description of a sample experiment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bădică, C, Bădică, A.: Rule Learning for Feature Values Extraction from HTML Product Information Sheets. In: Boley, H., Antoniou, G. (eds): Proc. RuleML’04, Hiroshima, Japan. LNCS 3323 Springer-Verlag (2004) 37–8.

    Google Scholar 

  2. Bădică, C, Bădică, A., Popescu, E.: Tuples Extraction from HTML Using Logic Wrappers and Inductive Logic Programming. In: Szczepaniak, PS., Kacprzyk, J., Niewiadomski, A. (eds.): Proc.AWIC’05, Lodz, Poland. LNAI 3528 Springer-Verlag (2005) 44–50.

    Google Scholar 

  3. Bădică, C, Bădică, A.: Logic Wrappers and XSLT Transformations for Tuples Extraction from HTML. In: Bressan, S.; Ceri, S.; Hunt, E.; Ives, Z.G.; Bellahsene, Z.; Rys, M.; Unland, R. (eds): Proc. 3 rd International XML Database Symposium XSym’05, Trondheim, Norway. LNCS 3671, Springer-Verlag (2005) 177–191

    Google Scholar 

  4. Chidlovskii, B.: Information Extraction from Tree Documents by Learning Subtree Delimiters. In: Proc. IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), Acapulco, Mexico (2003) 3–8.

    Google Scholar 

  5. Clark, J.: XSLT Transformation (XSLT) Version 1.0, W3C Recommendation, 16 November 1999, http://www.w3.org/TR/xslt (1999).

    Google Scholar 

  6. Gottlob, G., Koch, C, Schulz, K.U.: Conjunctive Queries over Trees. In: Proc.PODS’2004, Paris, France. ACM Press, (2004) 189–200.

    Google Scholar 

  7. Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for In formation Agents, In: Intelligent Information Agents R&D in Europe: An AgentLink perspective (Klusch, et al. eds.). LNCS 2586, Springer-Verlag (2003).

    Google Scholar 

  8. Li, Z., Ng, W.K.: WDEE: Web Data Extraction by Example. In: L. Zhou et al. (Eds.): Proc.DASFAA’2005, Beijing, China. LNCS 3453, Springer-Verlag (2005), 347–358.

    Google Scholar 

  9. Sakamoto, H., Arimura, H., Arikawa, S.: Knowledge Discovery from Semistructured Texts. In: Arikawa, S., Shinohara, A. (eds.): Progress in Discovery Science. LNCS 2281, Springer-Verlag (2002) 586–599.

    Google Scholar 

  10. World Wide Web Consortium. XML Path Language (XPath) Recommendation. http://www.w3c.org/TR/xpath/, November 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Bădică, C., Bădică, A., Popescu, E. (2006). A New Path Generalization Algorithm for HTML Wrapper Induction. In: Last, M., Szczepaniak, P.S., Volkovich, Z., Kandel, A. (eds) Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol 23. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33880-2_2

Download citation

  • DOI: https://doi.org/10.1007/3-540-33880-2_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33879-6

  • Online ISBN: 978-3-540-33880-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics