Abstract
Analyzing texts from social media often encounters many challenges, including shortness, dynamic, and huge size. Short texts do not provide enough information so that statistical models often fail to work. In this paper, we present a very simple approach (namely, bag-of-biterms) that helps statistical models such as Hierarchical Dirichlet Processes (HDP) to work well with short texts. By using both terms (words) and biterms to represent documents, bag-of-biterms (BoB) provides significant benefits: (1) it naturally lengthens representation and thus helps us reduce bad effects of shortness; (2) it enables the posterior inference in a large class of probabilistic models including HDP to be less intractable; (3) no modification of existing models/methods is necessary, and thus BoB can be easily employed in a wide class of statistical models. To evaluate those benefits of BoB, we take Online HDP into account in that it can deal with dynamic and massive text collections, and we do experiments on three large corpora of short texts which are crawled from Twitter, Yahoo Q&A, and New York Times. Extensive experiments show that BoB can help HDP work significantly better in both predictiveness and quality.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aletras, N., Stevenson, M.: Evaluating topic coherence using distributional semantics. In: Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013), pp. 13–22 (2013)
Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788. ACM (2007)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. WWW 7, 757–766 (2007)
Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)
Ye, C., Wen Wushao, P.Y.: TM-HDP: an effective nonparametric topic model for tibetan messages. J. Comput. Inf. Syst. 10, 10433–10444 (2014)
Grant, C.E., George, C.P., Jenneisch, C., Wilson, J.N.: Online topic modeling for real-time twitter search. In: TREC (2011)
Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14(1), 1303–1347 (2013)
Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)
Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892. ACM (2013)
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)
Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th International Conference On World Wide Web, pp. 377–386. ACM (2006)
Schönhofen, P.: Identifying document topics using the wikipedia category network. Web Intell. Agent Syst. 7(2), 195–207 (2009)
Tang, J., Meng, Z., Nguyen, X., Mei, Q., Zhang, M.: Understanding the limiting factors of topic modeling via posterior contraction analysis. In: Proceedings of The 31st International Conference on Machine Learning, pp. 190–198 (2014)
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)
Than, K., Doan, T.: Dual online inference for latent dirichlet allocation. In: Proceedings of the Sixth Asian Conference on Machine Learning (ACML), pp. 80–95 (2014)
Wang, C., Paisley, J.W., Blei, D.M.: Online variational inference for the hierarchical dirichlet process. In: International Conference on Artificial Intelligence and Statistics, pp. 752–760 (2011)
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 1445–1456 (2013)
Yih, W.T., Meek, C.: Improving similarity measures for short segments of text. AAAI 7, 1489–1494 (2007)
Acknowledgments
This work was partially supported by Vietnam National Foundation for Science and Technology Development (NAFOSTED Project No. 102.05-2014.28), and by AOARD (US Air Force) and ITC-PAC (US Army) under agreement number FA2386-15-1-4011.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Conversion of topic-over-biterms (distribution over biterms) to topic-over-words (distribution over words)
Appendix: Conversion of topic-over-biterms (distribution over biterms) to topic-over-words (distribution over words)
In BoB, after we finish training the model, we obtain topics, each is the multinomial distribution over biterms and we want to convert to the topics with distribution over words. Assume that \(\varvec{\phi }_k\) is the distribution over biterms of topic k. According to probability:
\( p(w_i \mid z=k) = \sum _{j=1}^{V}p(w_i,w_j \mid z=k) = \sum _{j=1}^{V}p(b_{ij} \mid z=k) = \sum _{j=1}^{V}\phi _{kb_{ij}} \),
As discussed in Sect. 4.2, in implementing BoB, we can merge \(b_{ij}\) and \(b_{ji}\) into \(b_{ij}\) with \(i<j\). Because of identical occurence in every document, after finishing training process, the value of \(p(b_{ij}\mid z=k)\) will be expectedly the same as \(p(b_{ji}\mid z=k)\). Therefore, in grouping these biterms into one, the conversion version of this implementation is: \( p(w_i\mid z=k) = \sum _{j=1}^{V}p(b_{ij} \mid z=k) = \phi _{kb_{ii}} + \frac{1}{2}\sum _{\text {b: biterms contain } w_i}\phi _{kb}. \)
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Mai, K., Mai, S., Nguyen, A., Van Linh, N., Than, K. (2016). Enabling Hierarchical Dirichlet Processes to Work Better for Short Texts at Large Scale. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_34
Download citation
DOI: https://doi.org/10.1007/978-3-319-31750-2_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)