A Statistics-Based Semantic Relation Analysis Approach for Document Clustering

Cheng, Xin; Miao, Duoqian; Wang, Lei

doi:10.1007/978-3-319-11740-9_31

A Statistics-Based Semantic Relation Analysis Approach for Document Clustering

Xin Cheng¹⁰,
Duoqian Miao¹⁰ &
Lei Wang¹⁰

Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8818))

Abstract

Document clustering is a widely research topic in the area of machine learning. A number of approaches have been proposed to represent and cluster documents. One of the recent trends in document clustering research is to incorporate the semantic information into document representation. In this paper, we introduce a novel technique for capturing the robust and reliable semantic information from term-term co-occurrence statistics. Firstly, we propose a novel method to evaluate the explicit semantic relation between terms from their co-occurrence information. Then the underlying semantic relation between terms is also captured by their interaction with other terms. Lastly, these two complementary semantic relations are integrated together to capture the complete semantic information from the original documents. Experimental results show that clustering performance improves significantly by enriching document representation with the semantic information.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Billhardt, H., Borrajo, D., Maojo, V.: A context vector model for information retrieval. Journal of the American Society for Information Science and Technology 53(3), 236–249 (2002)
Article Google Scholar
Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics 32(1), 13–47 (2006)
Article MATH Google Scholar
Bullinaria, J.A., Levy, J.P.: Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods 39(3), 510–526 (2007)
Article Google Scholar
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the world wide web. In: Proceedings of the 15th National Conference on Artificial Intelligence (1998)
Google Scholar
Figueiredo, F., Rocha, L., Couto, T., Salles, T., Gonçalves, M.A., Meira Jr, W.: Word co-occurrence features for text classification. Information Systems 36(5), 843–858 (2011)
Article Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI, vol. 7, pp. 1606–1611 (2007)
Google Scholar
Hu, X., Zhang, X., Lu, C., Park, E.K., Zhou, X.: Exploiting wikipedia as external knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 389–396 (2009)
Google Scholar
Kalogeratos, A., Likas, A.: Text document clustering using global term context vectors. Knowledge and Information Systems 31(3), 455–474 (2012)
Article Google Scholar
Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 170–178 (1995)
Google Scholar
Lewis, D.D.: Reuters-21578 text categorization test collection, distribution 1.0 (1997), http://www.research.att.com/~lewis/reuters21578.html
Burgess, C., Lund, K.: Modelling parsing constraints with high-dimensional context space. Language and cognitive processes 12(2-3), 177–210 (1997)
Article Google Scholar
Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM 38(11), 39–41 (1995)
Article Google Scholar
Wang, P., Hu, J., Zeng, H.J., Chen, Z.: Using wikipedia knowledge to improve text classification. Knowledge and Information Systems 19(3), 265–281 (2009)
Article Google Scholar
Wong, S.K.M., Ziarko, W., Wong, P.: Generalized vector spaces model in information retrieval. In: SIGIR 1985. pp. 18–25. ACM (1985)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tongji University, Shanghai, China
Xin Cheng, Duoqian Miao & Lei Wang

Authors

Xin Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Duoqian Miao
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Cheng .

Editor information

Editors and Affiliations

Tongji University, Shanghai, China
Duoqian Miao
Department of Electrical and Computer En, University of Alberta, Edmonton, Alberta, Canada
Witold Pedrycz
University of Warsaw, Warsaw, Poland
Dominik Ślȩzak
University of Applied Sciences, München, Germany
Georg Peters
Tianjin University, Tianjin, China
Qinghua Hu
Tongji University, Shanghai, China
Ruizhi Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheng, X., Miao, D., Wang, L. (2014). A Statistics-Based Semantic Relation Analysis Approach for Document Clustering. In: Miao, D., Pedrycz, W., Ślȩzak, D., Peters, G., Hu, Q., Wang, R. (eds) Rough Sets and Knowledge Technology. RSKT 2014. Lecture Notes in Computer Science(), vol 8818. Springer, Cham. https://doi.org/10.1007/978-3-319-11740-9_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-11740-9_31
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11739-3
Online ISBN: 978-3-319-11740-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics