World Wide Web

, Volume 21, Issue 2, pp 289–310 | Cite as

Finding maximal ranges with unique topics in a text database

  • Zhihui Yang
  • Huixin Ma
  • Zhenying He
  • X. Sean Wang
Article

Abstract

Recent years have witnessed the rapid growth of text data, and thus the increasing importance of in-depth analysis of text data for various applications. Text data are often organized in a database with documents labeled by attributes like time and location. Different documents manifest different topics. The topics of the documents may change along the attributes of the documents, and such changes have been the subject of research in the past. However, previous analyses techniques, such as topic detection and tracking, topic lifetime, and burstiness, all focus on the topic behavior of the documents in a given attribute range without contrasting to the documents in the overall range. This paper introduces the concept of u n i q u e t o p i c s, referring to those topics that only appear frequently within a small range of documents but not in the whole range. These unique topics may reflect some unique characteristics of documents in this small range not found outside of the range. The paper aims at an efficient pruning-based algorithm that, for a user-given set of keywords and a user-given attribute, finds the maximal ranges along the given attribute and their unique topics that are highly related to the given keyword set. Thorough experiments show that the algorithm is effective in various scenarios.

Keywords

Text database Unique topics mining Maximal ranges 

Notes

Acknowledgments

We thank Yaoliang Chen for his useful comments, and Chenghao Guo and Kaiwen Zhou for their enthusiastic help during the data collection process. We also thank the anonymous reviewers for their invaluable feedback and suggestions that have greatly improved this work. This work was partially supported by the NSFC (No. 61370080, No. 61170007) and the Shanghai Innovation Action Project (Grant No. 16DZ1100200), as well as by respective grants from EMC and SAP.

References

  1. 1.
    Allan, J., Papka, R., Lavrenko, V.: On-Line New Event Detection and Tracking. In: Proceedings of the 21St Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 37–45. ACM (1998)Google Scholar
  2. 2.
    Bay, S.D., Pazzani, M.J.: Detecting group differences: Mining contrast sets. Data Min. Knowl. Disc. 5(3), 213–246 (2001)CrossRefMATHGoogle Scholar
  3. 3.
    Bayardo Jr, R.J.: Efficiently Mining Long Patterns from Databases. In: ACM Sigmod Record, vol. 27, pp. 85–93. ACM (1998)Google Scholar
  4. 4.
    Blei, D.M., Lafferty, J.D.: Dynamic Topic Models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120. ACM (2006)Google Scholar
  5. 5.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATHGoogle Scholar
  6. 6.
    Chen, Z., Liu, B.: Mining Topics in Documents: Standing on the Shoulders of Big Data. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1116–1125. ACM (2014)Google Scholar
  7. 7.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)CrossRefGoogle Scholar
  8. 8.
    Dong, G., Li, J.: Efficient Mining of Emerging Patterns Discovering Trends and Differences Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 43–52. ACM (1999)Google Scholar
  9. 9.
    Doyle, G., Elkan, C.: Accounting for Burstiness in Topic Models Proceedings of the 26th Annual International Conference on Machine Learning, pp. 281–288. ACM (2009)Google Scholar
  10. 10.
    Fan, H., Ramamohanarao, K.: Efficiently Mining Interesting Emerging Patterns. In: Advances in Web-Age Information Management, pp. 189–201. Springer (2003)Google Scholar
  11. 11.
    Fiscus, J.G., Doddington, G.R.: Topic Detection and Tracking Evaluation Overview. In: Topic Detection and Tracking, pp. 17–31. Springer (2002)Google Scholar
  12. 12.
    Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., Pirahesh, H.: Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Disc. 1(1), 29–53 (1997)CrossRefGoogle Scholar
  13. 13.
    Griffiths, T., Steyvers, M.: A Probabilistic Approach to Semantic Representation. In: Proceedings of the 24th Annual Conference of the Cognitive Science Society, pp. 381–386. Citeseer (2002)Google Scholar
  14. 14.
    Griffiths, T., Steyvers, M., et al.: Prediction and semantic association, Advances in neural information processing systems, pp. 11–18 (2003)Google Scholar
  15. 15.
    Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. 101(suppl 1), 5228–5235 (2004)CrossRefGoogle Scholar
  16. 16.
    Herstein, I.N.: Topics in algebra. Blaisdell publishing company, waltham mass (1964)Google Scholar
  17. 17.
    Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)Google Scholar
  18. 18.
    Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1-2), 177–196 (2001)CrossRefMATHGoogle Scholar
  19. 19.
    Kalinin, A., Cetintemel, U., Zdonik, S.: Interactive Data Exploration Using Semantic Windows. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 505–516. ACM (2014)Google Scholar
  20. 20.
    Kleinberg, J.: Bursty and hierarchical structure in streams. Data Min. Knowl. Disc. 7(4), 373–397 (2003)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathematical statistics, pp. 79–86 (1951)Google Scholar
  22. 22.
    Lappas, T., Arai, B., Platakis, M., Kotsakos, D., Gunopulos, D.: On burstiness-aware search for document sequences. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 477–486. ACM (2009)Google Scholar
  23. 23.
    Lau, J.H., Collier, N., Baldwin, T.: On-Line Trend Analysis with Topic Models: # Twitter Trends Detection Topic Model Online. In: COLING, pp. 1519–1534 (2012)Google Scholar
  24. 24.
    Masada, T., Fukagawa, D., Takasu, A., Hamada, T., Shibata, Y., Oguri, K.: Dynamic Hyperparameter Optimization for Bayesian Topical Trend Analysis. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1831–1834. ACM (2009)Google Scholar
  25. 25.
    Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 487–494. AUAI Press (2004)Google Scholar
  26. 26.
    Steyvers, M., Griffiths, T.: Probabilistic topic models. Handbook of Latent Semantic Analysis 427(7), 424–440 (2007)Google Scholar
  27. 27.
    Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic Author-Topic Models for Information Discovery. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 306–315. ACM (2004)Google Scholar
  28. 28.
    Van Rijsbergen, C.J.: Information retrieval. The Information Retrieval Group (1979)Google Scholar
  29. 29.
    Wang, C., Blei, D., Heckerman, D.: Continuous time dynamic topic models. arXiv:1206.3298 (2012)
  30. 30.
    Wang, X., McCallum, A.: Topics over Time: a Non-Markov Continuous-Time Model of Topical Trends. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 424–433. ACM (2006)Google Scholar
  31. 31.
    Wayne, C.: Topic Detection and Tracking (Tdt) Overview and Perspective. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (1998)Google Scholar
  32. 32.
    Wayne, C.L.: Topic Detection and Tracking (Tdt). In: Workshop Held at the University of Maryland on, vol. 27, p 28. Citeseer (1997)Google Scholar
  33. 33.
    Zhang, D., Zhai, C., Han, J.: Topic cube: topic Modeling for Olap on multidimensional text databases. In: SDM, vol. 9, pp. 1124–1135 (2009)Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan UniversityShanghaiChina

Personalised recommendations