Abstract
The proliferation of texts in Web presents great challenges on knowledge discovery in text collections. Clustering provides us with a powerful tool to organize the information and recognize the structure of the information. Most text clustering techniques are designed to deal with either long or short texts. However many real-life collections are often made up of both long and short texts, namely mixed length texts. The current text clustering techniques are unsatisfactory, for they don’t distinguish the sparseness and high dimension of the mixed length texts. In this paper, we propose a novel approach - Length-Aware Dual Latent Dirichlet Allocation (ADLDA), which is used for clustering the mixed length texts via obtaining auxiliary knowledge from long (short) texts for short (long) texts in the collections. The degree of mutual auxiliary is based on the ratio of long texts and short texts in a corpus. Experimental results on real datasets show our approach achieves superior performance over other state-of the-art text clustering approaches for mixed length texts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Hearst, M.A.: TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)
Tagarelli, A., Karypis, G.: Document Clustering: The Next Frontier. Data Clustering: Algorithms and Applications 305 (2013)
Charu, C.A., ChengXiang, Z.: A survey of text clustering algorithms. In: Mining Text Data, pp. 77–128. Springer US (2012)
Karypis, G.: CLUTO-a clustering toolkit. Minnesota Univ. Minneapolis Dept. of Computer Science (2002)
Ponti, G., Tagarelli, A., Karypis, G.: A statistical model for topically segmented documents. In: Elomaa, T., Hollmén, J., Mannila, H. (eds.) DS 2011. LNCS, vol. 6926, pp. 247–261. Springer, Heidelberg (2011)
Du, L., Buntine, W.L., Jin, H.: A segmented topic model based on the two-parameter Poisson-Dirichlet process. Machine Learning 81(1), 5–19 (2010)
Du, L., Buntine, W.L., Jin, H.: Sequential latent dirichlet allocation: Discover underlying topic structures within a document. In: ICDM, pp. 148–157. IEEE (2010)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation[J]. The Journal of Machine Learning Research 3, 993–1022 (2003)
Ma, P., Zhang, Y.: MAKM: A MAFIA-Based k-Means Algorithm for Short Text in Social Networks. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013, Part II. LNCS, vol. 7826, pp. 210–218. Springer, Heidelberg (2013)
Jin, O., Liu, N.N., Zhao, K., et al.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the CIKM, pp. 775–784. ACM (2011)
Xuan-Hieu, P., Dieu-Thu, L., et al.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceeding of the WWW, pp. 91–100. ACM (2008)
Xuan-Hieu, P., Cam-Tu, N., Dieu-Thu, L., et al.: A hidden topic-based framework towards building applications with short web documents. IEEE Transactions on Knowledge and Data Engineering 27 (2010)
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the AAAI, pp. 775–780. AAAI Press (2006)
Xue, G.R., Dai, W., Yang, Q., et al.: Topic-bridged PLSA for cross-domain text classification. In: Proceedings of the SIGIR, pp. 627–634. ACM (2008)
Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proceedings of the SIGIR, pp. 787–788. ACM (2007)
Wang, Y., Jia, Y., Yang, S.: Short documents clustering in very large text databases. In: Feng, L., Wang, G., Zeng, C., Huang, R. (eds.) WISE 2006 Workshops. LNCS, vol. 4256, pp. 83–93. Springer, Heidelberg (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Chen, X., Zhang, Y., Yin, Y., Li, C., Xing, C. (2014). A LDA-Based Algorithm for Length-Aware Text Clustering. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds) Web Technologies and Applications. APWeb 2014. Lecture Notes in Computer Science, vol 8709. Springer, Cham. https://doi.org/10.1007/978-3-319-11116-2_45
Download citation
DOI: https://doi.org/10.1007/978-3-319-11116-2_45
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11115-5
Online ISBN: 978-3-319-11116-2
eBook Packages: Computer ScienceComputer Science (R0)