Characteristic Sets of Strings Common to Semi-structured Documents

Ikeda, Daisuke

doi:10.1007/3-540-46846-3_13

Daisuke Ikeda³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1721))

Included in the following conference series:

International Conference on Discovery Science

399 Accesses
3 Citations

Abstract

We consider Maximum Agreement Problem which is, given positive and negative documents, to find a characteristic set that matches many of positive documents but rejects many of negative ones. A characteristic set is a sequence (x ₁,...,x _d) of strings such that each x _i is a suffix of x _i+1 and all x i’s appear in a document without overlaps. A characteristic set matches semi-structured documents with primitives or user’s de_ned macros. For example, (“set”, “characteristic set”, “</title> characteristic set”) is a characteristic set extracted from an HTML file. But, an algorithm that solves Maximum Agreement Problem does not output useless characteristic sets, such as those made of only tags of HTML, since such characteristic sets may match most of positive documents but also match most of negative ones. We present an algorithm that, given an integer d which is the number of strings in a characteristic set, solves Maximum Agreement Problem in O(n ₂ h _d) time, where n is the total length of documents and h is the height of the su_x tree of the documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Abiteboul, Querying Semi-Structured Data. In Proc. of the 6th International Conference on Database Theory, (1997) 1–18.
Google Scholar
A. V. Aho, D. S. Hirschberg and J. D. Ullman, Bounds on the Complexity of the Longest Common Subsequences Problem, J. ACM, 23(1), pp. 1–12, 1976.
Article MATH MathSciNet Google Scholar
H. Ahonen, O. Heinonen, M. Klemettinen, and A. I. Verkamo, Mining in the Phrasal Frontier. In Proc. of the first European Symposium on Principles of Data Mining and Knowledge Discovery (1997), 343–350.
Google Scholar
R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules between Sets of Items in Large Databases. In Proc. of the 1993 ACM SIGMOD Conference on Management of Data (1993), 207–216.
Google Scholar
R. Agrawal and R. Srikant, Mining Sequential Patterns. In Proc. of the 11th Int. Conference on Data Engineering (1995), 3–14.
Google Scholar
H. Arimura and S. Shimozono, Maximizing Agreement with a Classification by Bounded or Unbounded Number of Associated Words. In Proc. of the 9th International Symposium on Algorithms and Computation (1998).
Google Scholar
H. Arimura, A. Wataki, R. Fujino, and S. Arikawa, A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases. In Proc. of the 9th Int. Workshop on Algorithmic Learning Theory, LNAI 1501, 247–261 (1998).
Google Scholar
R. J. Bayardo, Efficiently Mining Long Patterns from Databases. In Proc. of the 1998 ACM SIGMOD Conference on Management of Data (1998), 85–93.
Google Scholar
L. Devroye, W. Szpankowski, and B. Rais, A Note on the Height of Suffix Trees, SIAM J. Comput. Vol. 21, No. 1, pp. 48–53, February 1992.
Article MATH MathSciNet Google Scholar
R. Feldman and I. Dagan, Knowledge Discovery in Textual Databases (KDT). In Proc. of the first Int. Conference on Knowledge Discovery and Data Mining (1995), 112–117.
Google Scholar
M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward Efficient Agnostic Learning. Machine Learning, 17, pp. 115–141 1994.
MATH Google Scholar
H. Mannila and H. Toivonne, Discovering Generalized Episodes Using Minimal Occurrences, In Porc. of the second Int. Conference on Knowledge Discovery and Data Mining (1996), 146–151.
Google Scholar
E. M. McCreight, A Space-Economical Suffix Tree Construction Algorithm. J. ACM, 23(2), pp. 262–272, 1976.
Article MATH MathSciNet Google Scholar
M. Nakanishi, M. Hashidume, M. Ito, A. Hashimoto, A Linear-Time Algorithm for Computing Characteristic Strings. In Proc. of the 5th International Symposium on Algorithms and Computation (1994), 315–323.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Center, Kyushu University, Fukuoka, 812-8581, Japan
Daisuke Ikeda

Authors

Daisuke Ikeda
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, Fukuoka, 812-8581, Japan
Setsuo Arikawa
Graduate School of Media and Governance, Keio University, 5322 Endoh, Fujisawa-shi, Kanagawa, 252-8520, Japan
Koichi Furukawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ikeda, D. (1999). Characteristic Sets of Strings Common to Semi-structured Documents. In: Arikawa, S., Furukawa, K. (eds) Discovery Science. DS 1999. Lecture Notes in Computer Science(), vol 1721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46846-3_13

Download citation

DOI: https://doi.org/10.1007/3-540-46846-3_13
Published: 22 October 1999
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66713-1
Online ISBN: 978-3-540-46846-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics