Abstract
As high quality descriptors of web page semantics, social annotations or tags have been used for web document clustering and achieved promising results. However, most web pages have few tags (less than 10). This sparsity seriously limits the usage of tags for clustering. In this work, we propose a user-related tag expansion method to overcome this problem, which incorporates additional useful tags into the original tag document by utilizing user tagging data as background knowledge. Unfortunately, simply adding tags may cause topic drift, i.e., the dominant topic(s) of the original document may be changed. To tackle this problem, we have designed a novel generative model called Folk-LDA, which jointly models original and expanded tags as independent observations. Experimental results show that 1) our user-related tag expansion method can be effectively applied to over 90% tagged web documents; 2) Folk-LDA can alleviate topic drift in expansion, especially for those topic-specific documents; 3) the proposed tag-based clustering methods significantly outperform the word-based methods, which indicates that tags could be a better resource for the clustering task.
Similar content being viewed by others
References
Hotho A, Staab S, Stumme G. Wordnet improves text document clustering. In Proc. SIGIR 2003 Semantic Web Workshop, Toronto, Canada, Aug. 1, 2003.
Hu J, Fang L, Cao Y, Zeng H J, Li H, Yang Q, Chen Z. Enhancing text clustering by leveraging Wikipedia semantics. In Proc. SIGIR 2008, Singapore, Jul. 20-24, 2008, pp.179–186.
Heymann P, Koutrika G, Garcia-Molina H. Can social book-marking improve web search? In Proc. WSDM2008, Palo Alto, USA, Feb. 11-12, 2008, pp.195–206.
Ramage D, Heymann P, Manning C D, Garcia-Molina H. Clustering the tagged web. In Proc. WSDM2009, Barcelona, Spain, Feb. 9-12, 2009, pp.54–63.
http://www.dai-labor.de/en/competence_centers/irml/data-sets/, April 2010.
Li X, Guo L, Zhao Y E. Tag-based social interest discovery. In Proc. WWW2008, Beijing, China, Apr. 21-25, 2008, pp.675–684.
Wetzker R, Zimmermann C, Bauckhage C. Analyzing social bookmarking systems: A del.icio.us cookbook. In Proc. ECAI 2008 Mining Social Data Workshop, Patras, Greece, Jul. 21-25, 2008, pp.26–30.
Griffiths T L, Steyvers M. Finding scientific topics. In Proc. National Academy of Sciences, 2004, 101(Suppl.1): 5228–5235.
Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993–1022.
Lu C, Chen X, Park E K. Exploit the tripartite network of social tagging for web clustering. In Proc. CIKM2009, Hong Kong, China, Nov. 2-6, 2009, pp.1545–1548.
Manning C D, Raghavan P, Schtze H. Introduction to Information Retrieval. New York, USA: Cambridge University Press, 2008.
Liu T, Liu S, Chen Z, Ma W Y. An evaluation on feature selection for text clustering. In Proc. ICML 2003, Washington, DC, USA, Aug. 21-24, 2003, pp.488–495.
Yang Y, Pedersen J O. A comparative study on feature selection in text categorization. In Proc. ICML 1997, Nashville, USA, Jul. 8-12, 1997, pp.412–420.
McKeown K R, Barzilay R, Evans D, Hatzivassiloglou V, Klavans J L, Nenkova A, Sable C, Schiffman B, Sigelman S. Tracking and summarizing news on a daily basis with columbia’s newsblaster. In Proc. HLT-ACL 2002, San Diego, USA, Mar. 24-27, 2002, pp.280–285.
Kriegel H P, Kröger P, Zimek A. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data, 2009, 3(1): Articl No.1.
Zeng H J, He Q C, Chen Z, Ma W Y, Ma J. Learning to cluster web search results. In Proc. SIGIR 2004, Sheffield, UK, Jul. 25-29, 2004, pp.210–217.
Liu X, Croft W B. Cluster-based retrieval using language models. In Proc. SIGIR 2004, Sheffield, UK, Jul. 25-29, 2004, pp.186–193.
Dave K, Lawrence S, Pennock D M. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In Proc. WWW2003, Budapest, Hungary, May 20-24, 2003, pp.519–528.
Gabrilovich E, Markovitch S. Feature generation for text categorization using world knowledge. In Proc. IJCAI 2005, Edinburgh, Scotland, Jul. 30-Aug. 5, 2005, pp.1048–1053.
Gabrilovich E, Markovitch S. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proc. AAAI 2006, Boston, USA, Jul. 16-20, 2006, pp.1301–1306.
Su Z, Yang Q, Zhang H, Xu X, Hu Y. Correlation-based document clustering using web logs. In Proc. HICSS 2001, Jan. 3-6, 2001, p.5022.
Jing Y, Croft W B. An association thesaurus for information retrieval. Tech. Rep., University of Massachusetts Amherst, 1994.
Xu J, Croft W B. Query expansion using local and global document analysis. In Proc. SIGIR 1996, Zurich, Switzerland, Aug. 18-22, 1996, pp.4–11.
Tao T, Wang X, Mei Q, Zhai C. Language model information retrieval with document expansion. In Proc. HLTNAACL 2006, New York, USA, June 2006, pp.407–414.
Zhou D, Bian J, Zheng S, Zha H, Giles C L. Exploring social annotations for information retrieval. In Proc. WWW2008, Beijing, China, Apr. 21-25, 2008, pp.715–724.
Begelman G, Keller P, Smadja F. Automated tag clustering: Improving search and exploration in the tag space. In Proc. Collaborative Web Tagging Workshop at WWW2006, Edinburgh, Scotland, May 22, 2006, pp.22–26.
Heymann P, Garcia-Molina H. Collaborative creation of communal hierarchical taxonomies in social tagging systems. Tech. Rep. 2006-10, Department of Computer Science, Stanford University, 2006, http://ilpubs.stanford.edu:8090/775/, April 2010.
Gemmell J, Shepitsen A, Mobasher B, Burke R. Personalizing navigation in folksonomies using hierarchical tag clustering. In Proc. the 10th Int. Conference on Data Warehousing and Knowledge Discovery, Turin, Italy, Sept. 1-5, 2008, pp.196–205.
Shepitsen A, Gemmell J, Mobasher B, Burke R. Personalized recommendation in social tagging systems using hierarchical clustering. In Proc. RecSys 2008, Lausanne, Switzerland, Oct. 23-25, 2008, pp.259–266.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is supported by the National Natural Science Foundation of China under Grant No. 61070111.
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Li, P., Wang, B. & Jin, W. Improving Web Document Clustering through Employing User-Related Tag Expansion Techniques. J. Comput. Sci. Technol. 27, 554–566 (2012). https://doi.org/10.1007/s11390-012-1243-y
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-012-1243-y