Skip to main content
Log in

Improving Web Document Clustering through Employing User-Related Tag Expansion Techniques

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

As high quality descriptors of web page semantics, social annotations or tags have been used for web document clustering and achieved promising results. However, most web pages have few tags (less than 10). This sparsity seriously limits the usage of tags for clustering. In this work, we propose a user-related tag expansion method to overcome this problem, which incorporates additional useful tags into the original tag document by utilizing user tagging data as background knowledge. Unfortunately, simply adding tags may cause topic drift, i.e., the dominant topic(s) of the original document may be changed. To tackle this problem, we have designed a novel generative model called Folk-LDA, which jointly models original and expanded tags as independent observations. Experimental results show that 1) our user-related tag expansion method can be effectively applied to over 90% tagged web documents; 2) Folk-LDA can alleviate topic drift in expansion, especially for those topic-specific documents; 3) the proposed tag-based clustering methods significantly outperform the word-based methods, which indicates that tags could be a better resource for the clustering task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Hotho A, Staab S, Stumme G. Wordnet improves text document clustering. In Proc. SIGIR 2003 Semantic Web Workshop, Toronto, Canada, Aug. 1, 2003.

  2. Hu J, Fang L, Cao Y, Zeng H J, Li H, Yang Q, Chen Z. Enhancing text clustering by leveraging Wikipedia semantics. In Proc. SIGIR 2008, Singapore, Jul. 20-24, 2008, pp.179–186.

  3. Heymann P, Koutrika G, Garcia-Molina H. Can social book-marking improve web search? In Proc. WSDM2008, Palo Alto, USA, Feb. 11-12, 2008, pp.195–206.

  4. Ramage D, Heymann P, Manning C D, Garcia-Molina H. Clustering the tagged web. In Proc. WSDM2009, Barcelona, Spain, Feb. 9-12, 2009, pp.54–63.

  5. http://www.dai-labor.de/en/competence_centers/irml/data-sets/, April 2010.

  6. Li X, Guo L, Zhao Y E. Tag-based social interest discovery. In Proc. WWW2008, Beijing, China, Apr. 21-25, 2008, pp.675–684.

  7. Wetzker R, Zimmermann C, Bauckhage C. Analyzing social bookmarking systems: A del.icio.us cookbook. In Proc. ECAI 2008 Mining Social Data Workshop, Patras, Greece, Jul. 21-25, 2008, pp.26–30.

  8. Griffiths T L, Steyvers M. Finding scientific topics. In Proc. National Academy of Sciences, 2004, 101(Suppl.1): 5228–5235.

    Article  Google Scholar 

  9. Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993–1022.

    MATH  Google Scholar 

  10. Lu C, Chen X, Park E K. Exploit the tripartite network of social tagging for web clustering. In Proc. CIKM2009, Hong Kong, China, Nov. 2-6, 2009, pp.1545–1548.

  11. Manning C D, Raghavan P, Schtze H. Introduction to Information Retrieval. New York, USA: Cambridge University Press, 2008.

    MATH  Google Scholar 

  12. Liu T, Liu S, Chen Z, Ma W Y. An evaluation on feature selection for text clustering. In Proc. ICML 2003, Washington, DC, USA, Aug. 21-24, 2003, pp.488–495.

  13. Yang Y, Pedersen J O. A comparative study on feature selection in text categorization. In Proc. ICML 1997, Nashville, USA, Jul. 8-12, 1997, pp.412–420.

  14. McKeown K R, Barzilay R, Evans D, Hatzivassiloglou V, Klavans J L, Nenkova A, Sable C, Schiffman B, Sigelman S. Tracking and summarizing news on a daily basis with columbia’s newsblaster. In Proc. HLT-ACL 2002, San Diego, USA, Mar. 24-27, 2002, pp.280–285.

  15. Kriegel H P, Kröger P, Zimek A. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data, 2009, 3(1): Articl No.1.

  16. Zeng H J, He Q C, Chen Z, Ma W Y, Ma J. Learning to cluster web search results. In Proc. SIGIR 2004, Sheffield, UK, Jul. 25-29, 2004, pp.210–217.

  17. Liu X, Croft W B. Cluster-based retrieval using language models. In Proc. SIGIR 2004, Sheffield, UK, Jul. 25-29, 2004, pp.186–193.

  18. Dave K, Lawrence S, Pennock D M. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In Proc. WWW2003, Budapest, Hungary, May 20-24, 2003, pp.519–528.

  19. Gabrilovich E, Markovitch S. Feature generation for text categorization using world knowledge. In Proc. IJCAI 2005, Edinburgh, Scotland, Jul. 30-Aug. 5, 2005, pp.1048–1053.

  20. Gabrilovich E, Markovitch S. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proc. AAAI 2006, Boston, USA, Jul. 16-20, 2006, pp.1301–1306.

  21. Su Z, Yang Q, Zhang H, Xu X, Hu Y. Correlation-based document clustering using web logs. In Proc. HICSS 2001, Jan. 3-6, 2001, p.5022.

  22. Jing Y, Croft W B. An association thesaurus for information retrieval. Tech. Rep., University of Massachusetts Amherst, 1994.

  23. Xu J, Croft W B. Query expansion using local and global document analysis. In Proc. SIGIR 1996, Zurich, Switzerland, Aug. 18-22, 1996, pp.4–11.

  24. Tao T, Wang X, Mei Q, Zhai C. Language model information retrieval with document expansion. In Proc. HLTNAACL 2006, New York, USA, June 2006, pp.407–414.

  25. Zhou D, Bian J, Zheng S, Zha H, Giles C L. Exploring social annotations for information retrieval. In Proc. WWW2008, Beijing, China, Apr. 21-25, 2008, pp.715–724.

  26. Begelman G, Keller P, Smadja F. Automated tag clustering: Improving search and exploration in the tag space. In Proc. Collaborative Web Tagging Workshop at WWW2006, Edinburgh, Scotland, May 22, 2006, pp.22–26.

  27. Heymann P, Garcia-Molina H. Collaborative creation of communal hierarchical taxonomies in social tagging systems. Tech. Rep. 2006-10, Department of Computer Science, Stanford University, 2006, http://ilpubs.stanford.edu:8090/775/, April 2010.

  28. Gemmell J, Shepitsen A, Mobasher B, Burke R. Personalizing navigation in folksonomies using hierarchical tag clustering. In Proc. the 10th Int. Conference on Data Warehousing and Knowledge Discovery, Turin, Italy, Sept. 1-5, 2008, pp.196–205.

  29. Shepitsen A, Gemmell J, Mobasher B, Burke R. Personalized recommendation in social tagging systems using hierarchical clustering. In Proc. RecSys 2008, Lausanne, Switzerland, Oct. 23-25, 2008, pp.259–266.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Li.

Additional information

This work is supported by the National Natural Science Foundation of China under Grant No. 61070111.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(PDF 93.1 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, P., Wang, B. & Jin, W. Improving Web Document Clustering through Employing User-Related Tag Expansion Techniques. J. Comput. Sci. Technol. 27, 554–566 (2012). https://doi.org/10.1007/s11390-012-1243-y

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-012-1243-y

Keywords

Navigation