Abstract
With the rapid development of the internet and communication technology, huge data is accumulated. Short text such as paper abstract and email is common in such data. It is useful to cluster such short documents to get the data structure or to help build other data mining applications. But almost all the current clustering algorithms become very inefficient or even unusable when handle very large (hundreds of GB) and high-dimensional text data. It is also difficult to get acceptable clustering accuracy since key words appear only few times in short documents. In this paper, we propose a frequent term based parallel clustering algorithm which can be used to cluster short documents in very large text database. A novel semantic classification method is also used to improve the accuracy of clustering. Our experimental study shows that our algorithm is more accurate and efficient than other clustering algorithms when clustering large scale short documents. Furthermore, our algorithm has good scalability and can be used to process even huge data.
This project is sponsored by national 863 high technology development foundation (No.2004AA112020, No.2003AA115210 and No.2003AA111020).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)
Beil, F., Ester, M., Xu, X.: Frequent Term-Based Text Clustering. In: Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD 2002), Edmonton, Alberta, Canada (2002)
Fung, B.C.M., Wang, K., Ester, M.: Hierarchical Document Clustering using Frequent Itemsets. In: SDM 2003 (2003)
Jensen, E., Beitzel, S., Pilotto, A., Goharian, N., Frieder, O.: Parallelizing the Buckshot Algorithm for Efficient Document Clustering. In: ACM 11th Conference on Information and Knowledge Management (CIKM) (November 2002)
Hotho, A., Mädche, A., Staab, A.S.: Ontology-based Text Clustering. In: Workshop Text Learning: Beyond Supervision (2001)
Choudhary, B., Bhattacharyya, P.: Text Clustering Using Semantics. In: World Wide Web Conference (WWW 2002), Hawai, USA (May 2002)
Song, D., Bruza, P.D.: Discovering Information Flow Using a High Dimensional Conceptual Space. In: Proceedings of ACM SIGIR 2001, pp. 327–333 (2001)
Oracle Text 10g Technical Overview, http://www.oracle.com/technology/products/text/x/10g_tech_overview.html
Yongheng, W., Yan, J., Shuqiang, Y.: Parallel Mining of Top-K Frequent Itemsets in Very Large Text Database. In: Fan, W., Wu, Z., Yang, J. (eds.) WAIM 2005. LNCS, vol. 3739, pp. 706–712. Springer, Heidelberg (2005)
Lund, K., Burgess, C.: Producing High-dimensional Semantic Spaces from Lexical Co-occurrence. Behavior Research Methods, Instruments, & Computers 28(2), 203–208 (1996)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining 2000 (2000)
Letsche, T.A., Berry, M.W.: Large-scale information retrieval with latent semantic indexing. Information Sciences 100 (1997)
Siolas, G., d’Alch e-Buc, F.: Support vector machines based on a semantic kernel for text categorization. In: IEEE-IJCNN (2000)
Hofmann, T.: Probabilistic latent semantic indexing. In: Research and Development in Information Retrieval (1999)
Leskovec, J., Shawe-Taylor, J.: Semantic Text Features from Small World Graphs. In: Subspace, Latent Structure and Feature Selection techniques: Statistical and Optimization perspectives Workshop, Bohinj, Slovenia (2005)
Dijkstra, E.: Two Problems in Connexion with Graphs. Numerische Mathematik 1, 269–271 (1959)
Song, D., Bruza, P.D., Huang, Z., Lau: Classifying Document Titles Based on Information Inference. In: Zhong, N., Raś, Z.W., Tsumoto, S., Suzuki, E. (eds.) ISMIS 2003. LNCS (LNAI), vol. 2871, pp. 297–306. Springer, Heidelberg (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, Y., Jia, Y., Yang, S. (2006). Short Documents Clustering in Very Large Text Databases. In: Feng, L., Wang, G., Zeng, C., Huang, R. (eds) Web Information Systems – WISE 2006 Workshops. WISE 2006. Lecture Notes in Computer Science, vol 4256. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11906070_8
Download citation
DOI: https://doi.org/10.1007/11906070_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-47663-4
Online ISBN: 978-3-540-47664-1
eBook Packages: Computer ScienceComputer Science (R0)