Using Word Clusters to Detect Similar Web Documents

  • Jonathan Koberstein
  • Yiu-Kai Ng
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4092)


It is relatively easy to detect exact matches in Web documents; however, detecting similar content in distinct Web documents with different words and sentence structures is a much more difficult task. A reliable tool for determining the degree of similarity between any two Web documents could help filter or retain Web documents with similar content. Most methods for detecting similarity between documents rely on some kind of textual fingerprinting or a process of looking for exactly matched substrings. This may not be sufficient as changing the sentence structure or replacing words with synonyms can cause sentences with similar/same content to be treated as different. In this paper, we develop a sentence-based Fuzzy Set Information Retrieval (IR) approach, using word clusters that capture the similarity between different words for discovering similar documents. Our approach has the advantages of detecting documents with similar, but not necessarily the same, sentences based on fuzzy-word sets. The three different fuzzy-word clustering techniques that we have considered include the correlation cluster, the association cluster, and the metric cluster, which generate the word-to-word correlation values. Experimental results show that by adopting the metric cluster, our similarity detection approach has high accurate rate in detecting similar documents and improves previous Fuzzy Set IR approaches based solely on the correlation cluster.


Information Retrieval Correlation Factor Similar Document Correlation Cluster Distinct Word 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval (1999)Google Scholar
  2. 2.
  3. 3.
    Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: Proc. of the ACM SIGMOD, pp. 398–409 (1995)Google Scholar
  4. 4.
    Congdon, P.: Bayesian Statistical Modelling. Wiley, Chichester (2001)zbMATHGoogle Scholar
  5. 5.
    Cooper, J., Coden, A., Brown, E.: Detecting Similar Documents Using Salient Terms. In: Proc. of CIKM 2002, pp. 245–251 (2002)Google Scholar
  6. 6.
  7. 7.
  8. 8.
    Manber, U.: Finding Similar Files in Large File System. In: USENIX Winter Technical Conf. (1994)Google Scholar
  9. 9.
    Nevin, H.: Scalable Document Fingerprinting. In: Proc. of the 2nd USENIX Workshop on Electronic Commerce, pp. 191–200 (1996)Google Scholar
  10. 10.
    Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco (1988)Google Scholar
  11. 11.
    Pereira, A.R., Ziviani, N.: Retrieving Similar Documents from the Web. Journal of Web Engineering 2(4), 247–261 (2004)Google Scholar
  12. 12.
    Porter, M.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)Google Scholar
  13. 13.
    Rabelo, J., Silva, E., Fernandes, F., Meira, S., Barros, F.: ActiveSearch: An Agent for Suggesting Similar Documents Based on User’s Preferences. In: Proc. of the Intl. Conf. on Systems, Men & Cybernetics, pp. 549–554 (2001)Google Scholar
  14. 14.
  15. 15.
    Ruthven, I., Lalmas, M.: Experimenting on Dempster-Shafer’s Theory of Evidence in Information Retrieval. JIIS 19(3), 267–302 (2002)Google Scholar
  16. 16.
    Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. D-Lib Magazine (1995),
  17. 17.
  18. 18.
  19. 19.
    Yerra, R., Ng, Y.-K.: A Sentence-Based Copy Detection Approach for Web Documents. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS, vol. 3613, pp. 557–570. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  20. 20.
    Yu, C., Liu, K., Wu, W., Meng, W., Rishe, N.: Finding the Most Similar Documents Across Multiple Text Databases. In: Proc. of the IEEE Forum on Research and Technology Advances in Digital Libraries, pp. 150–162 (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Jonathan Koberstein
    • 1
  • Yiu-Kai Ng
    • 1
  1. 1.Computer Science DepartmentBrigham Young UniversityProvoUSA

Personalised recommendations