Skip to main content

Leveraging Network Structure for Incremental Document Clustering

  • Conference paper
Web Technologies and Applications (APWeb 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7235))

Included in the following conference series:

  • 2141 Accesses

Abstract

Recent studies have shown that link-based clustering methods can significantly improve the performance of content-based clustering. However, most previous algorithms are developed for fixed data sets, and are not applicable to the dynamic environments such as data warehouse and online digital library.

In this paper, we introduce a novel approach which leverages the network structure for incremental clustering. Under this framework, both the link and content information are incorporated to determine the host cluster of a new document. The combination of two types of information ensures a promising performance of the clustering results. Furthermore, the status of core members is used to quickly determine whether to split or merge a new cluster. This filtering process eliminates the unnecessary and time-consuming checks of textual similarity on the whole corpus, and thus greatly speeds up the entire procedure. We evaluate our proposed approach on several real-world publication data sets and conduct an extensive comparison with both the classic content based and the recent link based algorithms. The experimental results demonstrate the effectiveness and efficiency of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Angelova, R., Siersdorfer, S.: A neighborhood based approach for clustering of linked document collections. In: Proc. of the 15th ACM CIKM, pp. 778–779 (2006)

    Google Scholar 

  2. Angelova, R., Weikum, G.: Graph-based text classification: learn from your neighbors. In: Proc. of the 29th ACM SIGIR, pp. 485–492 (2006)

    Google Scholar 

  3. Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine Learning 42, 143–175 (2001)

    Article  MATH  Google Scholar 

  4. Ding, C.H.Q., He, X., Zha, H., Gu, M., Simon, H.D.: A min-max cut algorithm for graph partitioning and data clustering. In: Proc. of the ICDM, pp. 107–114 (2001)

    Google Scholar 

  5. Ester, M., Kriegel, H.P., Sander, J., Wimmer, M., Xu, X.: Incremental clustering for mining in a data warehousing environment. In: Proc. of 24th VLDB, pp. 323–333 (1998)

    Google Scholar 

  6. Kazama, J., Torisawa, K.: Exploiting wikipedia as external knowledge for named entity recognition. In: Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 698–707 (2007)

    Google Scholar 

  7. Li, H., Nie, Z., Lee, W., Giles, C., Wen, J.: Scalable community discovery on textual data with relations. In: Proc. of the 17th ACM CIKM, pp. 1203–1212 (2008)

    Google Scholar 

  8. Liu, X., Gong, Y., Xu, W., Zhu, S.: Document clustering with cluster refinement and model selection capabilities. In: Proc. of the 25th ACM SIGIR, pp. 590–599 (2002)

    Google Scholar 

  9. Menczer, F.: Lexical and semantic clustering by web links. JASIST 55, 1261–1269 (2004)

    Article  Google Scholar 

  10. Nguyen-Hoang, T.-A., Hoang, K., Bui-Thi, D., Nguyen, A.-T.: Incremental Document Clustering Based on Graph Model. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds.) ADMA 2009. LNCS, vol. 5678, pp. 569–576. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  11. Ordonez, C., Omiecinski, E.: Frem: fast and robust em clustering for large data sets. In: Proc. the ACM CIKM, pp. 590–599 (2002)

    Google Scholar 

  12. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)

    Article  MATH  Google Scholar 

  13. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. Tech. rep., University of Minnesota (2000)

    Google Scholar 

  14. Wang, J., Zeng, H., Chen, Z., Lu, H., Tao, L., Ma, W.Y.: Recom: Reinforcement clustering of multi-type interrelated data objects. In: Proc. of the 26th ACM SIGIR, pp. 274–281 (2003)

    Google Scholar 

  15. Zhang, X., Hu, X., Zhou, X.: A comparative evaluation of different link types on enhancing document clustering. In: Proc. of the 31st ACM SIGIR, pp. 555–562 (2008)

    Google Scholar 

  16. Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Tech. rep., University of Minnesota (2002)

    Google Scholar 

  17. Zhao, Y., Karypis, G., Fayyad, U.: Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery 10, 141–168 (2005)

    Article  MathSciNet  Google Scholar 

  18. Zhong, S.: Efficient online spherical k-means clustering. In: Proc. of IEEE IJCNN, pp. 3180–3185 (2005)

    Google Scholar 

  19. Zhou, X., Zhang, X., Hu, X.: Semantic smoothing of document models for agglomerative clustering. In: Proc. of the 20th IJCAI, pp. 2922–2927 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Qian, T., Si, J., Li, Q., Yu, Q. (2012). Leveraging Network Structure for Incremental Document Clustering. In: Sheng, Q.Z., Wang, G., Jensen, C.S., Xu, G. (eds) Web Technologies and Applications. APWeb 2012. Lecture Notes in Computer Science, vol 7235. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29253-8_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-29253-8_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29252-1

  • Online ISBN: 978-3-642-29253-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics