Short Documents Clustering in Very Large Text Databases

Wang, Yongheng; Jia, Yan; Yang, Shuqiang

doi:10.1007/11906070_8

Yongheng Wang²⁰,
Yan Jia²⁰ &
Shuqiang Yang²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4256))

Included in the following conference series:

International Conference on Web Information Systems Engineering

532 Accesses
1 Citations

Abstract

With the rapid development of the internet and communication technology, huge data is accumulated. Short text such as paper abstract and email is common in such data. It is useful to cluster such short documents to get the data structure or to help build other data mining applications. But almost all the current clustering algorithms become very inefficient or even unusable when handle very large (hundreds of GB) and high-dimensional text data. It is also difficult to get acceptable clustering accuracy since key words appear only few times in short documents. In this paper, we propose a frequent term based parallel clustering algorithm which can be used to cluster short documents in very large text database. A novel semantic classification method is also used to improve the accuracy of clustering. Our experimental study shows that our algorithm is more accurate and efficient than other clustering algorithms when clustering large scale short documents. Furthermore, our algorithm has good scalability and can be used to process even huge data.

This project is sponsored by national 863 high technology development foundation (No.2004AA112020, No.2003AA115210 and No.2003AA111020).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Beil, F., Ester, M., Xu, X.: Frequent Term-Based Text Clustering. In: Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD 2002), Edmonton, Alberta, Canada (2002)
Google Scholar
Fung, B.C.M., Wang, K., Ester, M.: Hierarchical Document Clustering using Frequent Itemsets. In: SDM 2003 (2003)
Google Scholar
Jensen, E., Beitzel, S., Pilotto, A., Goharian, N., Frieder, O.: Parallelizing the Buckshot Algorithm for Efficient Document Clustering. In: ACM 11th Conference on Information and Knowledge Management (CIKM) (November 2002)
Google Scholar
Hotho, A., Mädche, A., Staab, A.S.: Ontology-based Text Clustering. In: Workshop Text Learning: Beyond Supervision (2001)
Google Scholar
Choudhary, B., Bhattacharyya, P.: Text Clustering Using Semantics. In: World Wide Web Conference (WWW 2002), Hawai, USA (May 2002)
Google Scholar
Song, D., Bruza, P.D.: Discovering Information Flow Using a High Dimensional Conceptual Space. In: Proceedings of ACM SIGIR 2001, pp. 327–333 (2001)
Google Scholar
Oracle Text 10g Technical Overview, http://www.oracle.com/technology/products/text/x/10g_tech_overview.html
Yongheng, W., Yan, J., Shuqiang, Y.: Parallel Mining of Top-K Frequent Itemsets in Very Large Text Database. In: Fan, W., Wu, Z., Yang, J. (eds.) WAIM 2005. LNCS, vol. 3739, pp. 706–712. Springer, Heidelberg (2005)
Chapter Google Scholar
Lund, K., Burgess, C.: Producing High-dimensional Semantic Spaces from Lexical Co-occurrence. Behavior Research Methods, Instruments, & Computers 28(2), 203–208 (1996)
Article Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining 2000 (2000)
Google Scholar
Letsche, T.A., Berry, M.W.: Large-scale information retrieval with latent semantic indexing. Information Sciences 100 (1997)
Google Scholar
Siolas, G., d’Alch e-Buc, F.: Support vector machines based on a semantic kernel for text categorization. In: IEEE-IJCNN (2000)
Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Research and Development in Information Retrieval (1999)
Google Scholar
Leskovec, J., Shawe-Taylor, J.: Semantic Text Features from Small World Graphs. In: Subspace, Latent Structure and Feature Selection techniques: Statistical and Optimization perspectives Workshop, Bohinj, Slovenia (2005)
Google Scholar
Dijkstra, E.: Two Problems in Connexion with Graphs. Numerische Mathematik 1, 269–271 (1959)
Article MATH MathSciNet Google Scholar
Song, D., Bruza, P.D., Huang, Z., Lau: Classifying Document Titles Based on Information Inference. In: Zhong, N., Raś, Z.W., Tsumoto, S., Suzuki, E. (eds.) ISMIS 2003. LNCS (LNAI), vol. 2871, pp. 297–306. Springer, Heidelberg (2003)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Computer School, National University of Defense Technology, Changsha, China
Yongheng Wang, Yan Jia & Shuqiang Yang

Authors

Yongheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Jia
View author publications
You can also search for this author in PubMed Google Scholar
Shuqiang Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science & Technology, Tsinghua University, Beijing, China
Ling Feng
Northeastern University,, 110004, Shenyang Liaoning, China
Guoren Wang
State Key Lab of Software Engineering, Wuhan University, 430072, Wuhan, P.R. China
Cheng Zeng
School of Information Management, Wuhan University, 430072, Wuhan, China
Ruhua Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Jia, Y., Yang, S. (2006). Short Documents Clustering in Very Large Text Databases. In: Feng, L., Wang, G., Zeng, C., Huang, R. (eds) Web Information Systems – WISE 2006 Workshops. WISE 2006. Lecture Notes in Computer Science, vol 4256. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11906070_8

Download citation

DOI: https://doi.org/10.1007/11906070_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-47663-4
Online ISBN: 978-3-540-47664-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics