Skip to main content

Logic Wrappers and XSLT Transformations for Tuples Extraction from HTML

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3671))

Abstract

Recently it was shown that existing general-purpose inductive logic programming systems are useful for learning wrappers (known as L-wrappers) to extract data from HTML documents. Here we propose a formalization of L-wrappers and their patterns, including their syntax and semantics and related properties and operations. A mapping of the patterns to a subset of XSLT that has a formal semantics is outlined and demonstrated by an example. The mapping actually shows how the theory can be applied to obtain efficient wrappers for information extraction from HTML.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abiteoul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured data and XML. Morgan Kauffman Publishers, San Francisco (2000)

    Google Scholar 

  2. Aleph: http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/aleph.html

  3. Baumgartner, R., Flesca, S., Gottlob, G.: The Elog Web Extraction Language. In: Nieuwenhuis, R., Voronkov, A. (eds.) LPAR 2001. LNCS (LNAI), vol. 2250, pp. 548–560. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  4. Bădică, C., Bădică, A.: Rule Learning for Feature Values Extraction from HTML Product Information Sheets. In: Antoniou, G., Boley, H. (eds.) RuleML 2004. LNCS, vol. 3323, pp. 37–48. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  5. Bădică, C., Popescu, E., Bădică, A.: Learning Logic Wrappers for Information Extraction from the Web. In: Papazoglou, M., Yamazaki, K. (eds.) Proc. SAINT 2005 Workshops, Computer Intelligence for Exabyte Scale Data Explosion, Trento, Italy, pp. 336–339. IEEE Computer Society Press, Los Alamitos (2005)

    Google Scholar 

  6. Bădică, C., Bădică, A., Popescu, E.: Tuples Extraction from HTML Using Logic Wrappers and Inductive Logic Programming. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds.) AWIC 2005. LNCS (LNAI), vol. 3528, pp. 44–50. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  7. Bex, G.J., Maneth, S., Neven, F.: A formal model for an expressive fragment of XSLT. In: Information Systems, vol. 27, pp. 21–39. Elsevier Science, Amsterdam (2002)

    Google Scholar 

  8. Chidlovskii, B.: Information Extraction from Tree Documents by Learning Subtree Delimiters. In: Proc. IJCAI 2003 Workshop on Information Integration on the Web (IIWeb 2003), Acapulco, Mexico, pp. 3–8 (2003)

    Google Scholar 

  9. Clark, J.: XSLT Transformation (XSLT) Version 1.0, W3C Recommendation (November 16, 1999), http://www.w3.org/TR/xslt

  10. Cormen, T.H., Leiserson, C.E., Rivest, R.R.: Introduction to Algorithms. MIT Press, Cambridge (1990)

    MATH  Google Scholar 

  11. Freitag, D.: Information extraction from HTML: application of a general machine learning approach. In: Proceedings of AAAI 1998, pp. 517–523 (1998)

    Google Scholar 

  12. Gottlob, G., Koch, C., Schulz, K.U.: Conjunctive Queries over Trees. In: Proc. PODS 2004, Paris, France, pp. 189–200. ACM Press, New York (2004)

    Chapter  Google Scholar 

  13. Gottlob, G., Koch, C.: Monadic Datalog and the Expressive Power of Languages for Web Information Extraction. Journal of the ACM 51(1), 74–113 (2004)

    Article  MathSciNet  Google Scholar 

  14. Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for Information Agents. In: Klusch, M., Bergamaschi, S., Edwards, P., Petta, P. (eds.) Intelligent Information Agents. LNCS (LNAI), vol. 2586, pp. 79–103. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  15. Laender, A.H.F., Ribeiro-Neto, B., Silva, A.S., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. In: SIGMOD Record, vol. 31(2), pp. 84–93. ACM Press, New York (2002)

    Google Scholar 

  16. Laender, A.H.F., Ribeiro-Neto, B., Silva, A.S.: DEByE-Data Extraction By Example. Data & Knowledge Engineering 40(2), 121–154 (2002)

    Article  MATH  Google Scholar 

  17. Oxygen XML Editor, http://www.oxygenxml.com/

  18. Quinlan, J.R., Cameron-Jones, R.M.: Induction of Logic Programs: FOIL and Related Systems. New Generation Computing 13, 287–312 (1995)

    Article  Google Scholar 

  19. Thomas, B.: Token-Templates and Logic Programs for Intelligent Web Search. Intelligent Information Systems. Special Issue: Methodologies for Intelligent Information Systems 14(2/3), 241–261 (2000)

    Google Scholar 

  20. Xiao, L., Wissmann, D., Brown, M., Jablonski, S.: Information Extraction from HTML: Combining XML and Standard Techniques fro IE from theWeb. In: Monostori, L., Váncza, J., Ali, M. (eds.) IEA/AIE 2001. LNCS (LNAI), vol. 2070, pp. 165–174. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bădică, C., Bădică, A. (2005). Logic Wrappers and XSLT Transformations for Tuples Extraction from HTML. In: Bressan, S., et al. Database and XML Technologies. XSym 2005. Lecture Notes in Computer Science, vol 3671. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11547273_13

Download citation

  • DOI: https://doi.org/10.1007/11547273_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28583-0

  • Online ISBN: 978-3-540-31968-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics