Skip to main content

Latent Semantic Analysis Evaluation of Conceptual Dependency Driven Focused Crawling

  • Conference paper
Multimedia Communications, Services and Security (MCSS 2012)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 287))

Abstract

In this paper we study a focused crawler driven by deep semantic analysis provided by the Conceptual Dependency (CD) theory. We test in practice the application of CD scripts as an approach of defining topics (queries) in a focused crawler and its robustness in evaluating real text structures extracted from HTML documents. In order to benchmark its efficiency in comparison to classical approaches, apart from human evaluation we also provide an evaluation of the result set based on its internal similarity using Latent Semantic Analysis (LSA). The performed measurement brings us to the conclusion that the CD theory is well suited for evaluating the similarity of HTML documents provided a specific query, as it achieves a high precision measured through human evaluation. At the same time we observe the drawbacks of LSA used in the same context.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery (1999)

    Google Scholar 

  2. Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through url ordering. Computer Networks and ISDN Systems 30(1-7), 161–172 (1998); Proceedings of the Seventh International World Wide Web Conference

    Article  Google Scholar 

  3. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)

    Article  Google Scholar 

  4. Dorosz, K.: Usage of dedicated data structures for url databases in a large-scale crawling. Computer Science: rocznik Akademii Górniczo-Hutniczej imienia Stanisława Staszica w Krakowie 10, 7–17 (2009)

    Google Scholar 

  5. Dumais, S.: Enhancing Performance in Latent Semantic Indexing. Technical report, TM-ARH-017527 Technical Report, Bellcore (1990)

    Google Scholar 

  6. Hao, H.-W., Mu, C.-X., Yin, X.-C., Li, S., Wang, Z.-B.: An improved topic relevance algorithm for focused crawling. In: SMC, pp. 850–855 (2011)

    Google Scholar 

  7. Kuta, M., Kitowski, J.: Clustering Polish Texts with Latent Semantic Analysis. In: Rutkowski, L., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2010. LNCS, vol. 6114, pp. 532–539. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  8. Landauer, T.K., Dumais, S.T.: A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104(2), 211–240 (1997)

    Article  Google Scholar 

  9. Menczer, F., Pant, G., Srinivasan, P., Ruiz, M.E.: Evaluating topic-driven web crawlers (2001)

    Google Scholar 

  10. Passerini, A., Frasconi, P., Soda, G.: Evaluation Methods for Focused Crawling. In: Esposito, F. (ed.) AI*IA 2001. LNCS (LNAI), vol. 2175, pp. 33–39. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  11. Schank, R.C., Tesler, L.: A conceptual dependency parser for natural language. In: Proceedings of the 1969 Conference on Computational Linguistics, COLING 1969, pp. 1–3. Association for Computational Linguistics, Stroudsburg (1969)

    Chapter  Google Scholar 

  12. Zhang, H., Lu, J.: A fuzzy approach to ranking hyperlinks. In: Proceedings of the Fourth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 03, pp. 406–410. IEEE Computer Society, Washington, DC (2007)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dorosz, K., Korzycki, M. (2012). Latent Semantic Analysis Evaluation of Conceptual Dependency Driven Focused Crawling. In: Dziech, A., Czyżewski, A. (eds) Multimedia Communications, Services and Security. MCSS 2012. Communications in Computer and Information Science, vol 287. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30721-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-30721-8_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-30720-1

  • Online ISBN: 978-3-642-30721-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics