Abstract
Search engines are the online services available, which are used to locate necessary information on World Wide Web. As the web is growing at a very rapid rate, the pages that are similar to each other are also increasing. Hence, it is better to have a system that can discover similar web pages. In this paper, A Meta search approach is applied for the information retrieval purpose which retrieves pages from the result list of different search engines and content present in the web pages is analyzed on the basis of which system finds similarity between them. Web pages are represented in vector space which represents each web document as a vector and the terms present in that webpage as its components. Similarity is computed by using different similarity measures i.e. Cosine Similarity, Jaccards Coefficient and Dice Coefficient. A comparative analysis of these similarity measures is done to find out which measure performs better in terms of precision as well as recall.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Brin, S., Page, L.: The Anatomy of a Large Scale Hypertextual Web Search Engine. In: Proceedings of the Seventh International World Wide Web Conference, Bristbane, Australia (April 1998)
Manning, C.D., Raghavan, P.: An introduction to Information Retrieval. Preliminary draft© 2008 Cambridge UP (2008)
Kleinberg, J.M.: Authoritative Sources in a Hyperlink Environment. Journal of the ACM, (JACM) (1999)
Dean, J., Henzinger, M.R.: Finding Related Pages in the World Wide Web. In: The Proceedings of the 8th International World Wide Web Conference (May 1999)
Chirita, P.A., Olmedilla, D., Nejdl, W.: Finding Related Hubs and Authorities. In: The Proceedings of First Latin American Web Congress (2003)
Smucker, M.D., Allan, J.: Find-Similar: Similarity Browsing as a search tool. In: SIGIR 2006, pp. 461–468. ACM Press, New York (August 2006)
Grangier, D., Bengio, S.: Inferring Document Similarity From Hyperlinks. In: The Proceeding of CIKM 2005, pp. 359–360. ACM, New York (November 2005)
Fogaras, D.: Scaling Link based Similarity Search. In: The Proceeding of 14th International World Wide Web Conference, Japan (2005)
Lempel, R., Moran, S.: The Stochastic Approach for link-structure analysis (SALSA) and the TKC effect. In: The Proceedings of the 9th International World Wide Web Conference, Amersterdam, Netherlands (2000)
Amsler, R.: Application of citation-based automatic classification. Technical report, the Universtity of Texas at Austin Linguistics Research Center (December 1972)
Kessler, M.M.: Bibliographic Coupling Between Scientific Papers. American Documentation (1963)
Srikant, R., Bayardo, R.J., Ma, Y.: Scaling Up All Pairs Similarity Search. In: The Proceedings of 16th International Conference on World Wide Web, Canada (May 2007)
Di Iorio, E.: Detecting near –replicas on the Web by content and hyperlink analysis. In: The Proceedings of International Conference on Web Intelligence (WI 2003). IEEE, Los Alamitos (2003)
Charikar: Similarity Estimation Techniques from Rounding Algorithm. In: The Proceedings of the 34th Annual ACM Symposium on Theory of Computing. ACM Press, New York (2002)
Manku, G.S., Jain, A., Sarma, A.D.: Detecting Near- Duplicates for Web Crawling. In: The Proceedings of International Conference on World Wide Conference, Canada (May 2007)
Ali, Z., Tombros, A.: Factors affecting Web Page Similarity. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 487–501. Springer, Heidelberg (2005)
Nagwani, N.K., Bhansali, A.: An Object Oriented Email Clustering Model Using Weighted Similarities between Emails Attributes. The Proceedings of International Journal of Research and Reviews in Computer Science, IJRRCS (January 2010)
Aslam, J.A., Frost, M.: An Information–theoretic Measure for Document Similarity. In: The Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Canada (July 2003)
Shivakumar, N., Garcia-Molina, H.: Finding replicated web collections. In: The Proceedings of International Conference on Management of Databases, on Research and Development in Information Retrieval, Canada (July 2003)
Salton, G.: A Vector Space Model for automatic Indexing. Communications of ACM (November 1975)
Tsatsaronis, G., Panagiotopoulou, V.: A Generalized Vector Space Model for Text Retrieval Based on Semantic Relatedness. In: The Proceedings of EACL Student Research Workshop, Athens, pp. 70–78 (April 2009)
Simmetrics, Open Source API for text similarity Measurement, http://www.dcs.shef.ac.uk/-sam/simmetrics.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Singh, J., Kumar, M. (2011). A Meta Search Approach to Find Similarity between Web Pages Using Different Similarity Measures. In: Unnikrishnan, S., Surve, S., Bhoir, D. (eds) Advances in Computing, Communication and Control. ICAC3 2011. Communications in Computer and Information Science, vol 125. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-18440-6_19
Download citation
DOI: https://doi.org/10.1007/978-3-642-18440-6_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-18439-0
Online ISBN: 978-3-642-18440-6
eBook Packages: Computer ScienceComputer Science (R0)