Skip to main content

A Meta Search Approach to Find Similarity between Web Pages Using Different Similarity Measures

  • Conference paper
Advances in Computing, Communication and Control (ICAC3 2011)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 125))

Abstract

Search engines are the online services available, which are used to locate necessary information on World Wide Web. As the web is growing at a very rapid rate, the pages that are similar to each other are also increasing. Hence, it is better to have a system that can discover similar web pages. In this paper, A Meta search approach is applied for the information retrieval purpose which retrieves pages from the result list of different search engines and content present in the web pages is analyzed on the basis of which system finds similarity between them. Web pages are represented in vector space which represents each web document as a vector and the terms present in that webpage as its components. Similarity is computed by using different similarity measures i.e. Cosine Similarity, Jaccards Coefficient and Dice Coefficient. A comparative analysis of these similarity measures is done to find out which measure performs better in terms of precision as well as recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brin, S., Page, L.: The Anatomy of a Large Scale Hypertextual Web Search Engine. In: Proceedings of the Seventh International World Wide Web Conference, Bristbane, Australia (April 1998)

    Google Scholar 

  2. Manning, C.D., Raghavan, P.: An introduction to Information Retrieval. Preliminary draft© 2008 Cambridge UP (2008)

    Google Scholar 

  3. Kleinberg, J.M.: Authoritative Sources in a Hyperlink Environment. Journal of the ACM, (JACM) (1999)

    Google Scholar 

  4. Dean, J., Henzinger, M.R.: Finding Related Pages in the World Wide Web. In: The Proceedings of the 8th International World Wide Web Conference (May 1999)

    Google Scholar 

  5. Chirita, P.A., Olmedilla, D., Nejdl, W.: Finding Related Hubs and Authorities. In: The Proceedings of First Latin American Web Congress (2003)

    Google Scholar 

  6. Smucker, M.D., Allan, J.: Find-Similar: Similarity Browsing as a search tool. In: SIGIR 2006, pp. 461–468. ACM Press, New York (August 2006)

    Google Scholar 

  7. Grangier, D., Bengio, S.: Inferring Document Similarity From Hyperlinks. In: The Proceeding of CIKM 2005, pp. 359–360. ACM, New York (November 2005)

    Google Scholar 

  8. Fogaras, D.: Scaling Link based Similarity Search. In: The Proceeding of 14th International World Wide Web Conference, Japan (2005)

    Google Scholar 

  9. Lempel, R., Moran, S.: The Stochastic Approach for link-structure analysis (SALSA) and the TKC effect. In: The Proceedings of the 9th International World Wide Web Conference, Amersterdam, Netherlands (2000)

    Google Scholar 

  10. Amsler, R.: Application of citation-based automatic classification. Technical report, the Universtity of Texas at Austin Linguistics Research Center (December 1972)

    Google Scholar 

  11. Kessler, M.M.: Bibliographic Coupling Between Scientific Papers. American Documentation (1963)

    Google Scholar 

  12. Srikant, R., Bayardo, R.J., Ma, Y.: Scaling Up All Pairs Similarity Search. In: The Proceedings of 16th International Conference on World Wide Web, Canada (May 2007)

    Google Scholar 

  13. Di Iorio, E.: Detecting near –replicas on the Web by content and hyperlink analysis. In: The Proceedings of International Conference on Web Intelligence (WI 2003). IEEE, Los Alamitos (2003)

    Google Scholar 

  14. Charikar: Similarity Estimation Techniques from Rounding Algorithm. In: The Proceedings of the 34th Annual ACM Symposium on Theory of Computing. ACM Press, New York (2002)

    Google Scholar 

  15. Manku, G.S., Jain, A., Sarma, A.D.: Detecting Near- Duplicates for Web Crawling. In: The Proceedings of International Conference on World Wide Conference, Canada (May 2007)

    Google Scholar 

  16. Ali, Z., Tombros, A.: Factors affecting Web Page Similarity. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 487–501. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  17. Nagwani, N.K., Bhansali, A.: An Object Oriented Email Clustering Model Using Weighted Similarities between Emails Attributes. The Proceedings of International Journal of Research and Reviews in Computer Science, IJRRCS (January 2010)

    Google Scholar 

  18. Aslam, J.A., Frost, M.: An Information–theoretic Measure for Document Similarity. In: The Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Canada (July 2003)

    Google Scholar 

  19. Shivakumar, N., Garcia-Molina, H.: Finding replicated web collections. In: The Proceedings of International Conference on Management of Databases, on Research and Development in Information Retrieval, Canada (July 2003)

    Google Scholar 

  20. Salton, G.: A Vector Space Model for automatic Indexing. Communications of ACM (November 1975)

    Google Scholar 

  21. Tsatsaronis, G., Panagiotopoulou, V.: A Generalized Vector Space Model for Text Retrieval Based on Semantic Relatedness. In: The Proceedings of EACL Student Research Workshop, Athens, pp. 70–78 (April 2009)

    Google Scholar 

  22. Simmetrics, Open Source API for text similarity Measurement, http://www.dcs.shef.ac.uk/-sam/simmetrics.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Singh, J., Kumar, M. (2011). A Meta Search Approach to Find Similarity between Web Pages Using Different Similarity Measures. In: Unnikrishnan, S., Surve, S., Bhoir, D. (eds) Advances in Computing, Communication and Control. ICAC3 2011. Communications in Computer and Information Science, vol 125. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-18440-6_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-18440-6_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-18439-0

  • Online ISBN: 978-3-642-18440-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics