Advertisement

Abstract

Given the popularity of Web news services, we propose a topic mining framework that supports the identification of meaningful topics (themes) from news stream data. News articles are retrieved from Web news services and processed by data mining tools to produce useful higher-level knowledge, which is stored in a content description database. Instead of interacting with a Web news service directly, by exploiting the knowledge in the database, an information delivery agent can present an answer in response to a user request. A key challenging issue within news repository management is the high rate of documents update. That is, since several hundred news articles are published everyday by a single Web news service, it is essential to develop incremental data mining tools to cope with such dynamic environments. To this end, we present a sophisticated incremental hierarchical document clustering algorithm using a neighborhood search. The novelty of our proposed algorithm lies in exploiting locality information to reduce the amount of computation while producing high-quality clusters. Other components of topic mining (e.g., learning topic ontologies) can be performed based on the obtained document hierarchy. Experimental results show that our proposed incremental clustering produces high-quality clusters, and topic ontology provides an interpretation of the data at different levels of abstraction.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal, C.C., Gates, S.C., Yu, P.S.: On the merits of using supervised clustering for building categorization systems. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999)Google Scholar
  2. 2.
    Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic detection and tracking pilot study final report. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (1998)Google Scholar
  3. 3.
    Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-tree: an efficient and robust access method for points and rectangles. ACM SIGMOD Record 19(2), 322–331 (1990)CrossRefGoogle Scholar
  4. 4.
    Berchtold, S., Keim, D.A., Kreigel, H.P.: The X-tree: An index structure for high dimensional data. In: Proceedings of the 22nd International Conference on Very Large Data Bases (1996)Google Scholar
  5. 5.
    Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using linear algebra for intelligent information retrieval. SIAM Review 37(4), 573–595 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2001)zbMATHGoogle Scholar
  7. 7.
    Ertöz, L., Steinbach, M., Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 3rd SIAM International Conference on Data Mining (2003)Google Scholar
  8. 8.
    Guttman, A.: R-Trees: A dynamic index structure for spatial searching. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (1985)Google Scholar
  9. 9.
    Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Transactions on Computers C22, 1025–1034 (1973)CrossRefGoogle Scholar
  10. 10.
    Khan, L.: Ontology-based information selection. Ph.D. Thesis, University of Southern California (2000)Google Scholar
  11. 11.
    Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999)Google Scholar
  12. 12.
    Maedche, A., Staab, S.: Ontology learning for the Semantic Web. IEEE Intelligent Systems 16(2) (2001)Google Scholar
  13. 13.
    Melamed, I.D.: Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. In: Proceedings of the 3rd Workshop on Very Large Corpora (1995)Google Scholar
  14. 14.
    Miller, G.: Wordnet: An on-line lexical database. International Journal of Lexicography 3(4), 235–312 (1990)CrossRefGoogle Scholar
  15. 15.
    Pelleg, D., Moore, A.: X-means: Extending K-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine LearningGoogle Scholar
  16. 16.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  17. 17.
    Sahami, M.: Using machine learning to improve information access. Ph.D. Thesis, Stanford University (1999)Google Scholar
  18. 18.
    Salton, G., McGill, M.J.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)zbMATHGoogle Scholar
  19. 19.
    Yang, Y., Carbonell, J., Brown, R., Pierce, T., Archibald, B.T., Liu, X.: Learning approaches for detecting and tracking news events. IEEE Intelligent Systems: Special Issue on Applications of Intelligent Information Retrieval 14(4), 32–43 (1999)Google Scholar
  20. 20.
    Zadeh, L.A.: Similarity relations and fuzzy orderings. Information Sciences 3, 177–200 (1971)zbMATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (1996)Google Scholar
  22. 22.
    Zhao, Y., Karypis, G.: Evaluations of hierarchical clustering algorithms for document datasets. In: Proceedings of the 11th ACM International Conference on Information and Knowledge Management (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Seokkyung Chung
    • 1
  • Dennis McLeod
    • 1
  1. 1.Department of Computer Science, and Integrated Media System CenterUniversity of Southern CaliforniaLos Angeles

Personalised recommendations