Logic Wrappers and XSLT Transformations for Tuples Extraction from HTML

Bădică, Costin; Bădică, Amelia

doi:10.1007/11547273_13

Logic Wrappers and XSLT Transformations for Tuples Extraction from HTML

Costin Bădică²³ &
Amelia Bădică²⁴

Conference paper

607 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3671))

Abstract

Recently it was shown that existing general-purpose inductive logic programming systems are useful for learning wrappers (known as L-wrappers) to extract data from HTML documents. Here we propose a formalization of L-wrappers and their patterns, including their syntax and semantics and related properties and operations. A mapping of the patterns to a subset of XSLT that has a formal semantics is outlined and demonstrated by an example. The mapping actually shows how the theory can be applied to obtain efficient wrappers for information extraction from HTML.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abiteoul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured data and XML. Morgan Kauffman Publishers, San Francisco (2000)
Google Scholar
Aleph: http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/aleph.html
Baumgartner, R., Flesca, S., Gottlob, G.: The Elog Web Extraction Language. In: Nieuwenhuis, R., Voronkov, A. (eds.) LPAR 2001. LNCS (LNAI), vol. 2250, pp. 548–560. Springer, Heidelberg (2001)
Chapter Google Scholar
Bădică, C., Bădică, A.: Rule Learning for Feature Values Extraction from HTML Product Information Sheets. In: Antoniou, G., Boley, H. (eds.) RuleML 2004. LNCS, vol. 3323, pp. 37–48. Springer, Heidelberg (2004)
Chapter Google Scholar
Bădică, C., Popescu, E., Bădică, A.: Learning Logic Wrappers for Information Extraction from the Web. In: Papazoglou, M., Yamazaki, K. (eds.) Proc. SAINT 2005 Workshops, Computer Intelligence for Exabyte Scale Data Explosion, Trento, Italy, pp. 336–339. IEEE Computer Society Press, Los Alamitos (2005)
Google Scholar
Bădică, C., Bădică, A., Popescu, E.: Tuples Extraction from HTML Using Logic Wrappers and Inductive Logic Programming. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds.) AWIC 2005. LNCS (LNAI), vol. 3528, pp. 44–50. Springer, Heidelberg (2005)
Chapter Google Scholar
Bex, G.J., Maneth, S., Neven, F.: A formal model for an expressive fragment of XSLT. In: Information Systems, vol. 27, pp. 21–39. Elsevier Science, Amsterdam (2002)
Google Scholar
Chidlovskii, B.: Information Extraction from Tree Documents by Learning Subtree Delimiters. In: Proc. IJCAI 2003 Workshop on Information Integration on the Web (IIWeb 2003), Acapulco, Mexico, pp. 3–8 (2003)
Google Scholar
Clark, J.: XSLT Transformation (XSLT) Version 1.0, W3C Recommendation (November 16, 1999), http://www.w3.org/TR/xslt
Cormen, T.H., Leiserson, C.E., Rivest, R.R.: Introduction to Algorithms. MIT Press, Cambridge (1990)
MATH Google Scholar
Freitag, D.: Information extraction from HTML: application of a general machine learning approach. In: Proceedings of AAAI 1998, pp. 517–523 (1998)
Google Scholar
Gottlob, G., Koch, C., Schulz, K.U.: Conjunctive Queries over Trees. In: Proc. PODS 2004, Paris, France, pp. 189–200. ACM Press, New York (2004)
Chapter Google Scholar
Gottlob, G., Koch, C.: Monadic Datalog and the Expressive Power of Languages for Web Information Extraction. Journal of the ACM 51(1), 74–113 (2004)
Article MathSciNet Google Scholar
Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for Information Agents. In: Klusch, M., Bergamaschi, S., Edwards, P., Petta, P. (eds.) Intelligent Information Agents. LNCS (LNAI), vol. 2586, pp. 79–103. Springer, Heidelberg (2003)
Chapter Google Scholar
Laender, A.H.F., Ribeiro-Neto, B., Silva, A.S., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. In: SIGMOD Record, vol. 31(2), pp. 84–93. ACM Press, New York (2002)
Google Scholar
Laender, A.H.F., Ribeiro-Neto, B., Silva, A.S.: DEByE-Data Extraction By Example. Data & Knowledge Engineering 40(2), 121–154 (2002)
Article MATH Google Scholar
Oxygen XML Editor, http://www.oxygenxml.com/
Quinlan, J.R., Cameron-Jones, R.M.: Induction of Logic Programs: FOIL and Related Systems. New Generation Computing 13, 287–312 (1995)
Article Google Scholar
Thomas, B.: Token-Templates and Logic Programs for Intelligent Web Search. Intelligent Information Systems. Special Issue: Methodologies for Intelligent Information Systems 14(2/3), 241–261 (2000)
Google Scholar
Xiao, L., Wissmann, D., Brown, M., Jablonski, S.: Information Extraction from HTML: Combining XML and Standard Techniques fro IE from theWeb. In: Monostori, L., Váncza, J., Ali, M. (eds.) IEA/AIE 2001. LNCS (LNAI), vol. 2070, pp. 165–174. Springer, Heidelberg (2001)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Software Engineering Department, University of Craiova, Bvd.Decebal 107, Craiova, RO, 200440, Romania
Costin Bădică
Business Information Systems Department, University of Craiova, A.I.Cuza 13, Craiova, RO, 200585, Romania
Amelia Bădică

Authors

Costin Bădică
View author publications
You can also search for this author in PubMed Google Scholar
Amelia Bădică
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing, National University of Singapore,
Stéphane Bressan
Dipartimento di Elettronica e Informazione, Politecnico di Milano, Piazza L. Da Vinci, 32, I20133, Milano, Italy
Stefano Ceri
Department of Computer Science, ETH Zurich, Switzerland
Ela Hunt
Department of Computer and Information Science, University of Pennsylvania, 19104, Philadelphia, PA, USA
Zachary G. Ives
LIRMM - UMR 5506 CNRS, Université Montpellier 2, 161 Rue Ada, F-34392, Montpellier Cedex 5
Zohra Bellahsène
Microsoft Corporation, One Microsoft Way, 98052, Redmond, WA, USA
Michael Rys
Universität Duisburg-Essen,
Rainer Unland

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bădică, C., Bădică, A. (2005). Logic Wrappers and XSLT Transformations for Tuples Extraction from HTML. In: Bressan, S., et al. Database and XML Technologies. XSym 2005. Lecture Notes in Computer Science, vol 3671. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11547273_13

Download citation

DOI: https://doi.org/10.1007/11547273_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28583-0
Online ISBN: 978-3-540-31968-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics