Skip to main content

Short Documents Clustering in Very Large Text Databases

  • Conference paper
Web Information Systems – WISE 2006 Workshops (WISE 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4256))

Included in the following conference series:

Abstract

With the rapid development of the internet and communication technology, huge data is accumulated. Short text such as paper abstract and email is common in such data. It is useful to cluster such short documents to get the data structure or to help build other data mining applications. But almost all the current clustering algorithms become very inefficient or even unusable when handle very large (hundreds of GB) and high-dimensional text data. It is also difficult to get acceptable clustering accuracy since key words appear only few times in short documents. In this paper, we propose a frequent term based parallel clustering algorithm which can be used to cluster short documents in very large text database. A novel semantic classification method is also used to improve the accuracy of clustering. Our experimental study shows that our algorithm is more accurate and efficient than other clustering algorithms when clustering large scale short documents. Furthermore, our algorithm has good scalability and can be used to process even huge data.

This project is sponsored by national 863 high technology development foundation (No.2004AA112020, No.2003AA115210 and No.2003AA111020).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  2. Beil, F., Ester, M., Xu, X.: Frequent Term-Based Text Clustering. In: Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD 2002), Edmonton, Alberta, Canada (2002)

    Google Scholar 

  3. Fung, B.C.M., Wang, K., Ester, M.: Hierarchical Document Clustering using Frequent Itemsets. In: SDM 2003 (2003)

    Google Scholar 

  4. Jensen, E., Beitzel, S., Pilotto, A., Goharian, N., Frieder, O.: Parallelizing the Buckshot Algorithm for Efficient Document Clustering. In: ACM 11th Conference on Information and Knowledge Management (CIKM) (November 2002)

    Google Scholar 

  5. Hotho, A., Mädche, A., Staab, A.S.: Ontology-based Text Clustering. In: Workshop Text Learning: Beyond Supervision (2001)

    Google Scholar 

  6. Choudhary, B., Bhattacharyya, P.: Text Clustering Using Semantics. In: World Wide Web Conference (WWW 2002), Hawai, USA (May 2002)

    Google Scholar 

  7. Song, D., Bruza, P.D.: Discovering Information Flow Using a High Dimensional Conceptual Space. In: Proceedings of ACM SIGIR 2001, pp. 327–333 (2001)

    Google Scholar 

  8. Oracle Text 10g Technical Overview, http://www.oracle.com/technology/products/text/x/10g_tech_overview.html

  9. Yongheng, W., Yan, J., Shuqiang, Y.: Parallel Mining of Top-K Frequent Itemsets in Very Large Text Database. In: Fan, W., Wu, Z., Yang, J. (eds.) WAIM 2005. LNCS, vol. 3739, pp. 706–712. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  10. Lund, K., Burgess, C.: Producing High-dimensional Semantic Spaces from Lexical Co-occurrence. Behavior Research Methods, Instruments, & Computers 28(2), 203–208 (1996)

    Article  Google Scholar 

  11. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining 2000 (2000)

    Google Scholar 

  12. Letsche, T.A., Berry, M.W.: Large-scale information retrieval with latent semantic indexing. Information Sciences 100 (1997)

    Google Scholar 

  13. Siolas, G., d’Alch e-Buc, F.: Support vector machines based on a semantic kernel for text categorization. In: IEEE-IJCNN (2000)

    Google Scholar 

  14. Hofmann, T.: Probabilistic latent semantic indexing. In: Research and Development in Information Retrieval (1999)

    Google Scholar 

  15. Leskovec, J., Shawe-Taylor, J.: Semantic Text Features from Small World Graphs. In: Subspace, Latent Structure and Feature Selection techniques: Statistical and Optimization perspectives Workshop, Bohinj, Slovenia (2005)

    Google Scholar 

  16. Dijkstra, E.: Two Problems in Connexion with Graphs. Numerische Mathematik 1, 269–271 (1959)

    Article  MATH  MathSciNet  Google Scholar 

  17. Song, D., Bruza, P.D., Huang, Z., Lau: Classifying Document Titles Based on Information Inference. In: Zhong, N., Raś, Z.W., Tsumoto, S., Suzuki, E. (eds.) ISMIS 2003. LNCS (LNAI), vol. 2871, pp. 297–306. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, Y., Jia, Y., Yang, S. (2006). Short Documents Clustering in Very Large Text Databases. In: Feng, L., Wang, G., Zeng, C., Huang, R. (eds) Web Information Systems – WISE 2006 Workshops. WISE 2006. Lecture Notes in Computer Science, vol 4256. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11906070_8

Download citation

  • DOI: https://doi.org/10.1007/11906070_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-47663-4

  • Online ISBN: 978-3-540-47664-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics