A Meta Search Approach to Find Similarity between Web Pages Using Different Similarity Measures

Singh, Jaskirat; Kumar, Mukesh

doi:10.1007/978-3-642-18440-6_19

Jaskirat Singh² &
Mukesh Kumar²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 125))

Included in the following conference series:

International Conference on Advances in Computing, Communication and Control

2631 Accesses
5 Citations

Abstract

Search engines are the online services available, which are used to locate necessary information on World Wide Web. As the web is growing at a very rapid rate, the pages that are similar to each other are also increasing. Hence, it is better to have a system that can discover similar web pages. In this paper, A Meta search approach is applied for the information retrieval purpose which retrieves pages from the result list of different search engines and content present in the web pages is analyzed on the basis of which system finds similarity between them. Web pages are represented in vector space which represents each web document as a vector and the terms present in that webpage as its components. Similarity is computed by using different similarity measures i.e. Cosine Similarity, Jaccards Coefficient and Dice Coefficient. A comparative analysis of these similarity measures is done to find out which measure performs better in terms of precision as well as recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brin, S., Page, L.: The Anatomy of a Large Scale Hypertextual Web Search Engine. In: Proceedings of the Seventh International World Wide Web Conference, Bristbane, Australia (April 1998)
Google Scholar
Manning, C.D., Raghavan, P.: An introduction to Information Retrieval. Preliminary draft© 2008 Cambridge UP (2008)
Google Scholar
Kleinberg, J.M.: Authoritative Sources in a Hyperlink Environment. Journal of the ACM, (JACM) (1999)
Google Scholar
Dean, J., Henzinger, M.R.: Finding Related Pages in the World Wide Web. In: The Proceedings of the 8th International World Wide Web Conference (May 1999)
Google Scholar
Chirita, P.A., Olmedilla, D., Nejdl, W.: Finding Related Hubs and Authorities. In: The Proceedings of First Latin American Web Congress (2003)
Google Scholar
Smucker, M.D., Allan, J.: Find-Similar: Similarity Browsing as a search tool. In: SIGIR 2006, pp. 461–468. ACM Press, New York (August 2006)
Google Scholar
Grangier, D., Bengio, S.: Inferring Document Similarity From Hyperlinks. In: The Proceeding of CIKM 2005, pp. 359–360. ACM, New York (November 2005)
Google Scholar
Fogaras, D.: Scaling Link based Similarity Search. In: The Proceeding of 14th International World Wide Web Conference, Japan (2005)
Google Scholar
Lempel, R., Moran, S.: The Stochastic Approach for link-structure analysis (SALSA) and the TKC effect. In: The Proceedings of the 9th International World Wide Web Conference, Amersterdam, Netherlands (2000)
Google Scholar
Amsler, R.: Application of citation-based automatic classification. Technical report, the Universtity of Texas at Austin Linguistics Research Center (December 1972)
Google Scholar
Kessler, M.M.: Bibliographic Coupling Between Scientific Papers. American Documentation (1963)
Google Scholar
Srikant, R., Bayardo, R.J., Ma, Y.: Scaling Up All Pairs Similarity Search. In: The Proceedings of 16th International Conference on World Wide Web, Canada (May 2007)
Google Scholar
Di Iorio, E.: Detecting near –replicas on the Web by content and hyperlink analysis. In: The Proceedings of International Conference on Web Intelligence (WI 2003). IEEE, Los Alamitos (2003)
Google Scholar
Charikar: Similarity Estimation Techniques from Rounding Algorithm. In: The Proceedings of the 34th Annual ACM Symposium on Theory of Computing. ACM Press, New York (2002)
Google Scholar
Manku, G.S., Jain, A., Sarma, A.D.: Detecting Near- Duplicates for Web Crawling. In: The Proceedings of International Conference on World Wide Conference, Canada (May 2007)
Google Scholar
Ali, Z., Tombros, A.: Factors affecting Web Page Similarity. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 487–501. Springer, Heidelberg (2005)
Chapter Google Scholar
Nagwani, N.K., Bhansali, A.: An Object Oriented Email Clustering Model Using Weighted Similarities between Emails Attributes. The Proceedings of International Journal of Research and Reviews in Computer Science, IJRRCS (January 2010)
Google Scholar
Aslam, J.A., Frost, M.: An Information–theoretic Measure for Document Similarity. In: The Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Canada (July 2003)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: Finding replicated web collections. In: The Proceedings of International Conference on Management of Databases, on Research and Development in Information Retrieval, Canada (July 2003)
Google Scholar
Salton, G.: A Vector Space Model for automatic Indexing. Communications of ACM (November 1975)
Google Scholar
Tsatsaronis, G., Panagiotopoulou, V.: A Generalized Vector Space Model for Text Retrieval Based on Semantic Relatedness. In: The Proceedings of EACL Student Research Workshop, Athens, pp. 70–78 (April 2009)
Google Scholar
Simmetrics, Open Source API for text similarity Measurement, http://www.dcs.shef.ac.uk/-sam/simmetrics.html

Download references

Author information

Authors and Affiliations

University Institute of Engineering and Technology, Punjab University, Chandigarh, India
Jaskirat Singh & Mukesh Kumar

Authors

Jaskirat Singh
View author publications
You can also search for this author in PubMed Google Scholar
Mukesh Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Fr. Conceicao Rodrigues College of Engineering Fr. Agnel Ashram, 400050, Bandstand, Bandra (West) Mumbai, India
Srija Unnikrishnan , Sunil Surve & Deepak Bhoir , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singh, J., Kumar, M. (2011). A Meta Search Approach to Find Similarity between Web Pages Using Different Similarity Measures. In: Unnikrishnan, S., Surve, S., Bhoir, D. (eds) Advances in Computing, Communication and Control. ICAC3 2011. Communications in Computer and Information Science, vol 125. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-18440-6_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-18440-6_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-18439-0
Online ISBN: 978-3-642-18440-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics