Skip to main content

A Survey of Chinese Text Similarity Computation

  • Conference paper
Information Retrieval Technology (AIRS 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4993))

Included in the following conference series:

Abstract

There is not a natural delimiter between words in Chinese texts. Moreover, Chinese is a semotactic language with complicated structures focusing on semantics. Its differences from Western languages bring more difficulties in Chinese word segmentation and more challenges in Chinese natural language understanding. How to compute the Chinese text similarity with high precision, recall and low cost is a very important but challenging task. Many researchers have studied it for long time. In this paper, we examine existing Chinese text similarity measures, including measures based on statistics and semantics. Our work provides insights into the advantages and disadvantages of each method, including tradeoffs between effectiveness and efficiency. New directions of the future work are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. McGill, M., Koll, M., Norrreault, T.: An Evaluation of Factors Affecting Document Ranking by Information Retrieval Systems. Technical Report, Syracuse University School of Information Studies (1979)

    Google Scholar 

  2. Lesk, M.E.: Computer Evaluation of Indexing and Text Processing. Journal of the ACM 1, 8–36 (1968)

    Google Scholar 

  3. Beaza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999)

    Google Scholar 

  4. Wong, S.: On Modeling of Information Retrieval Concepts in Vector Spaces. ACM Transactions on Database Systems 2, 299–321 (1987)

    Article  Google Scholar 

  5. Becker, J., Kuropka, D.: Topic-based Vector Space Model. Business Information Systems. In: Proceedings of BIS 2003, Colorado Springs, USA (2003)

    Google Scholar 

  6. Cheng, Y., Wu, S.: Text Similarity Computing Based on Components. Computer Engineering and Design 18, 3444–3446 (2006)

    Google Scholar 

  7. Pan, Q., Wang, J., Shi, Z.: Text Similarity Computing Based on Attribute Theory. Chinese Journal of Computers 6, 653–655 (1999)

    Google Scholar 

  8. Zhang, H., Wang, G., Zhong, Y.: Text Similarity Computing Based on Hamming Distance. Computer Engineering and Applications 19, 21–22 (2001)

    Google Scholar 

  9. Agirre, E., Rigau, G.: A Proposal for Word Sense Disambiguation Using Conceptual Distance. In: International Conference on Recent Advances in Natural Language Processing, Velingrad, pp. 258–264 (1995)

    Google Scholar 

  10. Wang, B.: Study on Chinese-English Bi-language Corpus Automatic Ordering. Institute of Computing Technology, Chinese Academy of Science (1999)

    Google Scholar 

  11. Liu, Q., Li, S.: Words Semantic Similarity Computation Based on HowNet. In: Proceedings of the 3rd Symposium on Chinese Words Semantics, vol. 5 (2002)

    Google Scholar 

  12. Xia, T.: Study on Chinese Words Semantic Similarity Computation. Computer Engineering 6, 191–194 (2003)

    Google Scholar 

  13. Kwok, K.L.: Comparing Representations in Chinese Information Retrieval. In: Proceedings of the ACM SIGER 1997 Conference, pp. 34–41 (1997)

    Google Scholar 

  14. Zhao, Y., Li, Q.: Chinese Character Association Measurement Method and Its Application on Chinese Text Similarity Analysis. Computer Applications 6, 1396–1397, 1400 (2006)

    Google Scholar 

  15. Che, W.: Chinese Sentences Similarity Computation Oriented the Searching in Bilingual Sentence Pairs. In: The 7th National JSCH, pp. 81–88. Tsinghua University press, Beijing (2003)

    Google Scholar 

  16. Jin, Y.: Text Similarity Computing Based on Context Framework Model. Computer Engineering and Applications 16, 36–39 (2004)

    Google Scholar 

  17. Jin, B., Shi, Y., Teng, H.: Similarity Algorithm of Text Based on Semantic Understanding. Journal of Dalian University of Technology 2, 291–297 (2005)

    Google Scholar 

  18. Jin, B., Shi, Y., Teng, H.: Document-structure-based Copy Detection Algorithm. Journal of Dalian University of Technology 1, 125–130 (2007)

    Google Scholar 

  19. Javed, A., Aslam, M.F.: An Information-theoretic Measure for Document Similarity. ACM SIGIR 3, 449–450 (2003)

    Google Scholar 

  20. Lin, D.: An Information-theoretic Definition of Similarity. In: Proc. 15th International Conf. on Machine Learning (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Hang Li Ting Liu Wei-Ying Ma Tetsuya Sakai Kam-Fai Wong Guodong Zhou

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, X., Ju, S., Wu, S. (2008). A Survey of Chinese Text Similarity Computation. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_69

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-68636-1_69

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68633-0

  • Online ISBN: 978-3-540-68636-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics