Abstract
A new data model for filtering semi-structured texts is presented. Given positive and negative examples of HTML pages labeled by a labelling function, the HTML pages are divided into a set of paths using the XML parser. A path is a sequence of element nodes and text nodes such that a text node appears in only the tail of the path. The labels of an element node and a text node are called a tag and a text, respectively. The goal of a mining algorithm is to find an interesting pattern, called association path, which is a pair of a tag-sequence t and a word-sequence w represented by the word-association pattern [1]. An association path (t,w) agrees with a labelling function on a path p if t is a subsequence of the tag-sequence of p and w matches with the text of p iff p is in a positive example. The importance of such an associate path α is measured by the agreement of a labelling function on given data, i.e., the number of paths on which α agrees with the labelling function. We present a mining algorithm for this problem and show the efficiency of this model by experiments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Shimozono, S., Arimura, H., and Arikawa, S. Efficient discovery of optimal wordassociation patterns in large text databases. New Generation Computing 18:49–60, 2000.
Arora, S. Polynomial-time approximation schemes for Euclidean TSP and other geometric problems. Proc. 37th IEEE Symposium on Foundations of Computer Science, 2–12, 1996.
Abiteboul, S., Buneman, P., and Suciu, D. Data on the Web: From relations to semistructured data and XML, Morgan Kaufmann, San Francisco, CA, 2000.
Angluin, D. Queries and concept learning. Machine Learning 2:319–342, 1988.
Buneman, P., Davidson, S., Hillebrand, G., and Suciu, D. A query language and optimization techniques for unstructured data. University ofPennsylvania, Computer and Information Science Department, Technical Report MS-CIS 96-09, 1996.
Cohen, W. W. and Fan, W. Learning Page-Independent Heuristics for Extracting Data from Web Pages, Proc. WWW-99. 1999.
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. Learning to construct knowledge bases from the World Wide Web, Artificial Intelligence 118:69–113, 2000.
Freitag, D. Information extraction from HTML: Application of a general machine learning approach. Proc. the 15th National Conference on Artificial Intelligence, 517–523, 1998
Grieser, G., Jantke, K. P., Lange, S., and Thomas, B. A unifying approach to HTML wrapper representation and learning, Proc. the 3rd International Conference, DS2000, Lecture Notes in Artificial Intelligence 1967:50–64, 2000.
Hammer, J., Garcia-Molina, H., Cho, J., and Crespo, A. Extracting semistructured information from the Web. Proc. Workshop on Management ofSemistructur ed Data, 18–25, 1997.
Hsu, C.-N. Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. Proc. 1998 Workshop on AI and Information Integration, 66–73, 1998.
Kamada, T. Compact HTML for small information appliances. W3C NOTE 09-Feb-1998. http://www.w3.org/TR/1998/NOTE-compactHTML-19980209, 1998.
Kushmerick, N. Wrapper induction:efficiency and expressiveness. Artificial Intelligence 118:15–68,2000.
Lin, S.,and Kernighan, B.W. An effective heuristic algorithm for the travelling salesman problem.Operations Research 21:498–516,1973.
Muslea, I., Minton, S.,and Knoblock, C. A. Wrapper induction for semistructured, web-based information sources.Proc.Conference on Automated Learning and Discovery,1998.
Sakamoto, H., Arimura, H.,and Arikawa, S. Identification of tree translation rules from examples.Proc.the 5th International Colloquium on Grammatical Inference, LNAI 1891:241–255,2000.
Thomas, B. Anti-unification based learning of T-Wrappers for information extraction,Proc.AAAI Workshop on Machine Learning for IE,15–20,AAAI,1999.
Valiant, L.G. A theory of the learnable.Comm.ACM 27:1134–1142,1984.
Wang, J.T., Chirn, G.W., Marr, T.G., Shapiro, B., Shasha, D.,and Zhang, K. Combinatorial pattern discovery for scientific data:Some preliminary results.Proc. SIGMOD’94,115–125,1994.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Taniguchi, K., Sakamoto, H., Arimura, H., Shimozono, S., Arikawa, S. (2001). Mining Semi-structured Data by Path Expressions. In: Jantke, K.P., Shinohara, A. (eds) Discovery Science. DS 2001. Lecture Notes in Computer Science(), vol 2226. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45650-3_32
Download citation
DOI: https://doi.org/10.1007/3-540-45650-3_32
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42956-2
Online ISBN: 978-3-540-45650-6
eBook Packages: Springer Book Archive