Using Word Clusters to Detect Similar Web Documents

Koberstein, Jonathan; Ng, Yiu-Kai

doi:10.1007/11811220_19

Jonathan Koberstein²¹ &
Yiu-Kai Ng²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4092))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1184 Accesses
31 Citations

Abstract

It is relatively easy to detect exact matches in Web documents; however, detecting similar content in distinct Web documents with different words and sentence structures is a much more difficult task. A reliable tool for determining the degree of similarity between any two Web documents could help filter or retain Web documents with similar content. Most methods for detecting similarity between documents rely on some kind of textual fingerprinting or a process of looking for exactly matched substrings. This may not be sufficient as changing the sentence structure or replacing words with synonyms can cause sentences with similar/same content to be treated as different. In this paper, we develop a sentence-based Fuzzy Set Information Retrieval (IR) approach, using word clusters that capture the similarity between different words for discovering similar documents. Our approach has the advantages of detecting documents with similar, but not necessarily the same, sentences based on fuzzy-word sets. The three different fuzzy-word clustering techniques that we have considered include the correlation cluster, the association cluster, and the metric cluster, which generate the word-to-word correlation values. Experimental results show that by adopting the metric cluster, our similarity detection approach has high accurate rate in detecting similar documents and improves previous Fuzzy Set IR approaches based solely on the correlation cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval (1999)
Google Scholar
http://packetstormsecurity.nl/Crackers/bigdict.gz
Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: Proc. of the ACM SIGMOD, pp. 398–409 (1995)
Google Scholar
Congdon, P.: Bayesian Statistical Modelling. Wiley, Chichester (2001)
MATH Google Scholar
Cooper, J., Coden, A., Brown, E.: Detecting Similar Documents Using Salient Terms. In: Proc. of CIKM 2002, pp. 245–251 (2002)
Google Scholar
http://prdownloads.sourceforge.net/wordlist/12dicts-4.0.zip
http://www.luziusschneider.com/Speller/ISpEnFrGe.exe
Manber, U.: Finding Similar Files in Large File System. In: USENIX Winter Technical Conf. (1994)
Google Scholar
Nevin, H.: Scalable Document Fingerprinting. In: Proc. of the 2^nd USENIX Workshop on Electronic Commerce, pp. 191–200 (1996)
Google Scholar
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco (1988)
Google Scholar
Pereira, A.R., Ziviani, N.: Retrieving Similar Documents from the Web. Journal of Web Engineering 2(4), 247–261 (2004)
Google Scholar
Porter, M.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)
Google Scholar
Rabelo, J., Silva, E., Fernandes, F., Meira, S., Barros, F.: ActiveSearch: An Agent for Suggesting Similar Documents Based on User’s Preferences. In: Proc. of the Intl. Conf. on Systems, Men & Cybernetics, pp. 549–554 (2001)
Google Scholar
http://www.ime.usp.br/~yoshi/mac324/projecto/dicas/entras/words
Ruthven, I., Lalmas, M.: Experimenting on Dempster-Shafer’s Theory of Evidence in Information Retrieval. JIIS 19(3), 267–302 (2002)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. D-Lib Magazine (1995), http://www.dlib.org
http://en.wikipedia.org/wiki/Wikipedia:Database_download
http://en.wikipedia.org/wiki/Wikipedia:Overview_FAQ (February 03, 2006)
Yerra, R., Ng, Y.-K.: A Sentence-Based Copy Detection Approach for Web Documents. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS, vol. 3613, pp. 557–570. Springer, Heidelberg (2005)
Chapter Google Scholar
Yu, C., Liu, K., Wu, W., Meng, W., Rishe, N.: Finding the Most Similar Documents Across Multiple Text Databases. In: Proc. of the IEEE Forum on Research and Technology Advances in Digital Libraries, pp. 150–162 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Brigham Young University, Provo, UT, 84602, USA
Jonathan Koberstein & Yiu-Kai Ng

Authors

Jonathan Koberstein
View author publications
You can also search for this author in PubMed Google Scholar
Yiu-Kai Ng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IRIT, UPS,, F-31062, Toulouse Cédex 9, France
Jérôme Lang
Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong
Fangzhen Lin
Guangxi Normal University, Guilin, China
Ju Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Koberstein, J., Ng, YK. (2006). Using Word Clusters to Detect Similar Web Documents. In: Lang, J., Lin, F., Wang, J. (eds) Knowledge Science, Engineering and Management. KSEM 2006. Lecture Notes in Computer Science(), vol 4092. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11811220_19

Download citation

DOI: https://doi.org/10.1007/11811220_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37033-8
Online ISBN: 978-3-540-37035-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics