Skip to main content

A Statistics-Based Semantic Relation Analysis Approach for Document Clustering

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8818))

Abstract

Document clustering is a widely research topic in the area of machine learning. A number of approaches have been proposed to represent and cluster documents. One of the recent trends in document clustering research is to incorporate the semantic information into document representation. In this paper, we introduce a novel technique for capturing the robust and reliable semantic information from term-term co-occurrence statistics. Firstly, we propose a novel method to evaluate the explicit semantic relation between terms from their co-occurrence information. Then the underlying semantic relation between terms is also captured by their interaction with other terms. Lastly, these two complementary semantic relations are integrated together to capture the complete semantic information from the original documents. Experimental results show that clustering performance improves significantly by enriching document representation with the semantic information.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Billhardt, H., Borrajo, D., Maojo, V.: A context vector model for information retrieval. Journal of the American Society for Information Science and Technology 53(3), 236–249 (2002)

    Article  Google Scholar 

  2. Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics 32(1), 13–47 (2006)

    Article  MATH  Google Scholar 

  3. Bullinaria, J.A., Levy, J.P.: Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods 39(3), 510–526 (2007)

    Article  Google Scholar 

  4. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the world wide web. In: Proceedings of the 15th National Conference on Artificial Intelligence (1998)

    Google Scholar 

  5. Figueiredo, F., Rocha, L., Couto, T., Salles, T., Gonçalves, M.A., Meira Jr, W.: Word co-occurrence features for text classification. Information Systems 36(5), 843–858 (2011)

    Article  Google Scholar 

  6. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI, vol. 7, pp. 1606–1611 (2007)

    Google Scholar 

  7. Hu, X., Zhang, X., Lu, C., Park, E.K., Zhou, X.: Exploiting wikipedia as external knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 389–396 (2009)

    Google Scholar 

  8. Kalogeratos, A., Likas, A.: Text document clustering using global term context vectors. Knowledge and Information Systems 31(3), 455–474 (2012)

    Article  Google Scholar 

  9. Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 170–178 (1995)

    Google Scholar 

  10. Lewis, D.D.: Reuters-21578 text categorization test collection, distribution 1.0 (1997), http://www.research.att.com/~lewis/reuters21578.html

  11. Burgess, C., Lund, K.: Modelling parsing constraints with high-dimensional context space. Language and cognitive processes 12(2-3), 177–210 (1997)

    Article  Google Scholar 

  12. Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  13. Wang, P., Hu, J., Zeng, H.J., Chen, Z.: Using wikipedia knowledge to improve text classification. Knowledge and Information Systems 19(3), 265–281 (2009)

    Article  Google Scholar 

  14. Wong, S.K.M., Ziarko, W., Wong, P.: Generalized vector spaces model in information retrieval. In: SIGIR 1985. pp. 18–25. ACM (1985)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Cheng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Cheng, X., Miao, D., Wang, L. (2014). A Statistics-Based Semantic Relation Analysis Approach for Document Clustering. In: Miao, D., Pedrycz, W., Ślȩzak, D., Peters, G., Hu, Q., Wang, R. (eds) Rough Sets and Knowledge Technology. RSKT 2014. Lecture Notes in Computer Science(), vol 8818. Springer, Cham. https://doi.org/10.1007/978-3-319-11740-9_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11740-9_31

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11739-3

  • Online ISBN: 978-3-319-11740-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics