HotMiner: Discovering Hot Topics from Dirty Text

  • Malú Castellanos


For companies with websites that contain millions of documents available to their customers, it is critical to identify their customers’ hottest information needs along with their associated documents. This valuable information gives companies the potential of reducing costs and being more competitive and responsive to their customers’ needs. In particular, technical support centers could drastically lower the number of support engineers by knowing the topics of their customers’ hot problems (i.e., hot topics), and making them available on their websites along with links to the corresponding solutions documents so that customers could efficiently find the right documents to self-solve their problems. In this chapter we present a novel approach to discovering hot topics of customers’ problems by mining the logs of customer support centers. Our technique for search log mining discovers hot topics that match the user’s perspective, which often is different from the topics derived from document content categorization’ methods. Our techniques to mine case logs include extracting relevant sentences from cases to conform case excerpts which are more suitable for hot topics mining. In contrast to most text mining work, our approach deals with dirty text containing typos, adhoc abbreviations, special symbols, incorrect use of English grammar, cryptic tables and ambiguous and missing punctuation. It includes a variety of techniques that either directly handle some of these anomalies or that are robust in spite of them. In particular, we have developed a postfiltering technique to deal with the effects of noisy clickstreams due to random clicking behavior, a Thesaurus Assistant to help in the generation of a thesaurus of “dirty” variations of words that is used to normalize the terminology, and a Sentence Identifier with the capability of eliminating code and tables. The techniques that compose our approach have been implemented as a toolbox HotMiner, which has been used in experiments on logs from Hewlett-Packard’s Customer Support Center


Noun Phrase Edit Distance Case Document Content View Sentence Boundary 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [BE99]
    R. Barzilayand and M. Elhadad. Using Lexical Chains for Text Summarization.In [MM99], 1999.Google Scholar
  2. [CHJ61]
    W.D. Climenson, H.H. Hardwick, and S.N. Jacobson. Automatic syntax analysis in machine indexing and abstracting. American Documentation, 12 (3): 178–183, 1961.CrossRefGoogle Scholar
  3. [CKPT92]
    D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Turkey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, pages 318–329, Jun 1992.Google Scholar
  4. [DHS01]
    R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification, second edition. Wiley, New York, 2001.zbMATHGoogle Scholar
  5. [DPHS98]
    S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the ACM CIKM International Conference on Information and Knowledge Management, Bethesda, MD, Nov 1998.Google Scholar
  6. [Edm68]
    H.P. Edmundson. New methods in automatic extraction. Journal of the ACM, 16 (2): 264–285, 1968.CrossRefGoogle Scholar
  7. [EE98]
    M.A. Elmi and M. Evens. Spelling correction using context.In Proceedings of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, pages 360–364, 1998.Google Scholar
  8. HL99] E.H. Hovy and H. Liu. The value of indicator phrases for automated text summarization.Unpublished, 1999. [Inx] Inx [online].Available from World Wide Web: www:inxight. corn/products/linguistx.Google Scholar
  9. [KHKL96]
    T. Kohonen, J. Hynninen, J. Kangas, and J. Laaksonen. Sompak: The self-organizing map program package.Laboratory of Computer and Information Science, Report A31, 1996.Google Scholar
  10. [Koh92]
    T. Kohonen. The Self-Organizing Map. Neural Networks: Theoretical Foundations and Analysis IEEE Press, New York 1992.Google Scholar
  11. [KPC95]
    J. Kupied, J. Piedersen, and F. Chen. A trainable document summarizer. In Proceedings of the Eighteenth Annual International SIGIR Conference on Research and Development in Information Retrieval, pages 68–73, 1995.Google Scholar
  12. [Kue87]
    G.H. Kuenning International ispell version 3.1.00.f tp. c s. uc 1 a. edu, 1987.Google Scholar
  13. [Kuk92]
    K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24 (4): 377–439, 1992.CrossRefGoogle Scholar
  14. [Leh82]
    W.G. Lehnert. Plot Units: A Narrative Summarization Strategy. Erlbaum, Hillsdale, NJ, 1982.Google Scholar
  15. [LSCP96]
    D. Lewis, R. Schapire, J. Cllan, and R. Papka. Training algorithms for linear text classifiers. In Proceedings of SIGIR-96, Nineteenth ACM International Conference on Research and Development in Information Retrieval, 1996.Google Scholar
  16. [Luh58]
    H.P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2 (2), 1958.Google Scholar
  17. [Mar97]
    D. Marcu. The Rhetorical Parsing, Summarization and Generation of Natural Language Texts.PhD dissertation. University of Toronto, 1997.Google Scholar
  18. [McI82]
    M.D. McIlroy.Development of a spelling list. IEEE Transactions on Communication, 30, 1: 91–99, Jan 1982.CrossRefGoogle Scholar
  19. [MM99]
    I. Mani and M. Maybury. Introduction. Advances in Automatic Text Summarization. MIT Press, Cambridge, MA, 1999.Google Scholar
  20. [Nun90]
    G. Nunberg. The linguistics of punctuation. Center for the Study of Language and Information Lecture Notes 90 (18), 1990.Google Scholar
  21. [RR94]
    J. Reynar and A. Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Conference on Applied Natural Language, 1994.Google Scholar
  22. [Sa189]
    G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989.Google Scholar
  23. [SKK00]
    M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering algorithms. In Proceedings of the KDD Workshop on Text Mining, 2000.Google Scholar
  24. [SSMB99]
    G. Salton, A. Singhal, M. Mitra, and C. Buckley. Automatic Text Structuring and Summarization. I n [MM99], 1999.Google Scholar
  25. [Sti00]
    J.R. Stinger. Automatic table detection method and system.HP Internal Paper, 2000.Google Scholar
  26. [SW81]
    T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147: 195–197, 1981.CrossRefGoogle Scholar
  27. [Too00]
    J. Toole. Categorizing unknown words: Using decision trees to identify names and misspellings. In Proceedings of the Sixth Applied Natural Language Processing Conference, pages 173–179, 2000.CrossRefGoogle Scholar
  28. [Wi188]
    P. Willet. Recent trends in hierarchical document clustering: A critical review. Information Processing and Management, 577 (97), 1988.Google Scholar
  29. [YY99]
    X. Lui Y. Yang. A reexamination of text categorization methods. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR’99), University of California, Berkeley, pages 42–49, 1999.Google Scholar

Copyright information

© Springer Science+Business Media New York 2004

Authors and Affiliations

  • Malú Castellanos

There are no affiliations available

Personalised recommendations