Skip to main content

Characteristic Sets of Strings Common to Semi-structured Documents

  • Conference paper
  • First Online:
Discovery Science (DS 1999)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1721))

Included in the following conference series:

Abstract

We consider Maximum Agreement Problem which is, given positive and negative documents, to find a characteristic set that matches many of positive documents but rejects many of negative ones. A characteristic set is a sequence (x 1,...,x d) of strings such that each x i is a suffix of x i+1 and all x i’s appear in a document without overlaps. A characteristic set matches semi-structured documents with primitives or user’s de_ned macros. For example, (“set”, “characteristic set”, “</title> characteristic set”) is a characteristic set extracted from an HTML file. But, an algorithm that solves Maximum Agreement Problem does not output useless characteristic sets, such as those made of only tags of HTML, since such characteristic sets may match most of positive documents but also match most of negative ones. We present an algorithm that, given an integer d which is the number of strings in a characteristic set, solves Maximum Agreement Problem in O(n 2 h d) time, where n is the total length of documents and h is the height of the su_x tree of the documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. Abiteboul, Querying Semi-Structured Data. In Proc. of the 6th International Conference on Database Theory, (1997) 1–18.

    Google Scholar 

  2. A. V. Aho, D. S. Hirschberg and J. D. Ullman, Bounds on the Complexity of the Longest Common Subsequences Problem, J. ACM, 23(1), pp. 1–12, 1976.

    Article  MATH  MathSciNet  Google Scholar 

  3. H. Ahonen, O. Heinonen, M. Klemettinen, and A. I. Verkamo, Mining in the Phrasal Frontier. In Proc. of the first European Symposium on Principles of Data Mining and Knowledge Discovery (1997), 343–350.

    Google Scholar 

  4. R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules between Sets of Items in Large Databases. In Proc. of the 1993 ACM SIGMOD Conference on Management of Data (1993), 207–216.

    Google Scholar 

  5. R. Agrawal and R. Srikant, Mining Sequential Patterns. In Proc. of the 11th Int. Conference on Data Engineering (1995), 3–14.

    Google Scholar 

  6. H. Arimura and S. Shimozono, Maximizing Agreement with a Classification by Bounded or Unbounded Number of Associated Words. In Proc. of the 9th International Symposium on Algorithms and Computation (1998).

    Google Scholar 

  7. H. Arimura, A. Wataki, R. Fujino, and S. Arikawa, A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases. In Proc. of the 9th Int. Workshop on Algorithmic Learning Theory, LNAI 1501, 247–261 (1998).

    Google Scholar 

  8. R. J. Bayardo, Efficiently Mining Long Patterns from Databases. In Proc. of the 1998 ACM SIGMOD Conference on Management of Data (1998), 85–93.

    Google Scholar 

  9. L. Devroye, W. Szpankowski, and B. Rais, A Note on the Height of Suffix Trees, SIAM J. Comput. Vol. 21, No. 1, pp. 48–53, February 1992.

    Article  MATH  MathSciNet  Google Scholar 

  10. R. Feldman and I. Dagan, Knowledge Discovery in Textual Databases (KDT). In Proc. of the first Int. Conference on Knowledge Discovery and Data Mining (1995), 112–117.

    Google Scholar 

  11. M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward Efficient Agnostic Learning. Machine Learning, 17, pp. 115–141 1994.

    MATH  Google Scholar 

  12. H. Mannila and H. Toivonne, Discovering Generalized Episodes Using Minimal Occurrences, In Porc. of the second Int. Conference on Knowledge Discovery and Data Mining (1996), 146–151.

    Google Scholar 

  13. E. M. McCreight, A Space-Economical Suffix Tree Construction Algorithm. J. ACM, 23(2), pp. 262–272, 1976.

    Article  MATH  MathSciNet  Google Scholar 

  14. M. Nakanishi, M. Hashidume, M. Ito, A. Hashimoto, A Linear-Time Algorithm for Computing Characteristic Strings. In Proc. of the 5th International Symposium on Algorithms and Computation (1994), 315–323.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ikeda, D. (1999). Characteristic Sets of Strings Common to Semi-structured Documents. In: Arikawa, S., Furukawa, K. (eds) Discovery Science. DS 1999. Lecture Notes in Computer Science(), vol 1721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46846-3_13

Download citation

  • DOI: https://doi.org/10.1007/3-540-46846-3_13

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-66713-1

  • Online ISBN: 978-3-540-46846-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics