Abstract
Recently it was shown that existing general-purpose inductive logic programming systems are useful for learning wrappers (known as L-wrappers) to extract data from HTML documents. Here we propose a formalization of L-wrappers and their patterns, including their syntax and semantics and related properties and operations. A mapping of the patterns to a subset of XSLT that has a formal semantics is outlined and demonstrated by an example. The mapping actually shows how the theory can be applied to obtain efficient wrappers for information extraction from HTML.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Abiteoul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured data and XML. Morgan Kauffman Publishers, San Francisco (2000)
Aleph: http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/aleph.html
Baumgartner, R., Flesca, S., Gottlob, G.: The Elog Web Extraction Language. In: Nieuwenhuis, R., Voronkov, A. (eds.) LPAR 2001. LNCS (LNAI), vol. 2250, pp. 548–560. Springer, Heidelberg (2001)
Bădică, C., Bădică, A.: Rule Learning for Feature Values Extraction from HTML Product Information Sheets. In: Antoniou, G., Boley, H. (eds.) RuleML 2004. LNCS, vol. 3323, pp. 37–48. Springer, Heidelberg (2004)
Bădică, C., Popescu, E., Bădică, A.: Learning Logic Wrappers for Information Extraction from the Web. In: Papazoglou, M., Yamazaki, K. (eds.) Proc. SAINT 2005 Workshops, Computer Intelligence for Exabyte Scale Data Explosion, Trento, Italy, pp. 336–339. IEEE Computer Society Press, Los Alamitos (2005)
Bădică, C., Bădică, A., Popescu, E.: Tuples Extraction from HTML Using Logic Wrappers and Inductive Logic Programming. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds.) AWIC 2005. LNCS (LNAI), vol. 3528, pp. 44–50. Springer, Heidelberg (2005)
Bex, G.J., Maneth, S., Neven, F.: A formal model for an expressive fragment of XSLT. In: Information Systems, vol. 27, pp. 21–39. Elsevier Science, Amsterdam (2002)
Chidlovskii, B.: Information Extraction from Tree Documents by Learning Subtree Delimiters. In: Proc. IJCAI 2003 Workshop on Information Integration on the Web (IIWeb 2003), Acapulco, Mexico, pp. 3–8 (2003)
Clark, J.: XSLT Transformation (XSLT) Version 1.0, W3C Recommendation (November 16, 1999), http://www.w3.org/TR/xslt
Cormen, T.H., Leiserson, C.E., Rivest, R.R.: Introduction to Algorithms. MIT Press, Cambridge (1990)
Freitag, D.: Information extraction from HTML: application of a general machine learning approach. In: Proceedings of AAAI 1998, pp. 517–523 (1998)
Gottlob, G., Koch, C., Schulz, K.U.: Conjunctive Queries over Trees. In: Proc. PODS 2004, Paris, France, pp. 189–200. ACM Press, New York (2004)
Gottlob, G., Koch, C.: Monadic Datalog and the Expressive Power of Languages for Web Information Extraction. Journal of the ACM 51(1), 74–113 (2004)
Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for Information Agents. In: Klusch, M., Bergamaschi, S., Edwards, P., Petta, P. (eds.) Intelligent Information Agents. LNCS (LNAI), vol. 2586, pp. 79–103. Springer, Heidelberg (2003)
Laender, A.H.F., Ribeiro-Neto, B., Silva, A.S., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. In: SIGMOD Record, vol. 31(2), pp. 84–93. ACM Press, New York (2002)
Laender, A.H.F., Ribeiro-Neto, B., Silva, A.S.: DEByE-Data Extraction By Example. Data & Knowledge Engineering 40(2), 121–154 (2002)
Oxygen XML Editor, http://www.oxygenxml.com/
Quinlan, J.R., Cameron-Jones, R.M.: Induction of Logic Programs: FOIL and Related Systems. New Generation Computing 13, 287–312 (1995)
Thomas, B.: Token-Templates and Logic Programs for Intelligent Web Search. Intelligent Information Systems. Special Issue: Methodologies for Intelligent Information Systems 14(2/3), 241–261 (2000)
Xiao, L., Wissmann, D., Brown, M., Jablonski, S.: Information Extraction from HTML: Combining XML and Standard Techniques fro IE from theWeb. In: Monostori, L., Váncza, J., Ali, M. (eds.) IEA/AIE 2001. LNCS (LNAI), vol. 2070, pp. 165–174. Springer, Heidelberg (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bădică, C., Bădică, A. (2005). Logic Wrappers and XSLT Transformations for Tuples Extraction from HTML. In: Bressan, S., et al. Database and XML Technologies. XSym 2005. Lecture Notes in Computer Science, vol 3671. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11547273_13
Download citation
DOI: https://doi.org/10.1007/11547273_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28583-0
Online ISBN: 978-3-540-31968-9
eBook Packages: Computer ScienceComputer Science (R0)