Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Indexing the Web

  • Edleno Silva de MouraEmail author
  • Marco Antonio Cristo
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_1145


Web indexing


The process of collecting, parsing, and storing data to provide fast and accurate retrieval of content available on the web. The result of this process is a structure called index that maps the collected data (for instance, words, phrases, concepts, or sound fragments) to the web location where it is possible to find content associated with the data (for instance, pages containing these words, phrases, concepts, or music with the sound fragments). Depending on the data collected, several indices may be created. The process can be manual or automatic. Manually generated indices include web directories, back-of-book-style indices, and metadata. Automatically generated indices are normally associated with the infrastructure of search engines.

Historical Background

One of the first efforts to index the web content was developed by a MIT student, Matthew Grey, who created a program to estimate the size of the web. This program, called World Wide Web...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Baeza-Yates R, Castillo C, Marin M, Rodriguez A. Crawling a country: better strategies than breadth-first for web page ordering. In: Proceedings of the 14th International World Wide Web Conference; 2005. p. 864–72.Google Scholar
  2. 2.
    Baeza-Yates RA, Ribeiro-Neto B. Modern information retrieval. 2nd ed. New York/Toronto: Addison-Wesley; 2011.Google Scholar
  3. 3.
    Brin S, Page L. The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst. 1998;30(1–7):107–17.CrossRefGoogle Scholar
  4. 4.
    Deerwester S, Dumais ST, Landauer TK, Furnas GW, Harshman RA. Indexing by latent semantic analysis. J Soc Inf Sci. 1990;41(6):391–407.CrossRefGoogle Scholar
  5. 5.
    Heymann P, Koutrika G, Garcia-Molina H. Fighting spam on social web sites: a survey of approaches and future challenges. IEEE Internet Comput. 2007;11(6):36–45.CrossRefGoogle Scholar
  6. 6.
    Kleinberg JM. Authoritative sources in a hyperlinked environment. J ACM. 1999;46(5):604–32.MathSciNetzbMATHCrossRefGoogle Scholar
  7. 7.
    Liu Y, Zhang D, Lu G, Ma WY. A survey of content-based image retrieval with high-level semantics. Pattern Recognit. 2007;40(1):262–82.zbMATHCrossRefGoogle Scholar
  8. 8.
    Manning CD, Raghavan P, Schütze H. Introduction to information retrieval, Ch. 18, 19, 20 (optional). Cambridge: Cambridge University Press; 2008.zbMATHCrossRefGoogle Scholar
  9. 9.
    Mostafa J. Seeking better web searches. Sci Am Mag. 2005;292(2):50–87.Google Scholar
  10. 10.
    Salton G. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Boston: Addison-Wesley Longman Publishing Co.; 1989.Google Scholar
  11. 11.
    Sonnenreich W. A history of search engines. 1999. Available at http://www.wiley.com/legacy/compbooks/sonnenreich/history.html.Google Scholar
  12. 12.
    Souza J, Carvalho A, da Costa L, Cristo M, de Moura ES, Calado P, Chirita P-A, Nejdl W. Using site-level connections to estimate link confidence. JASIST 2012;63(11):2294–312.CrossRefGoogle Scholar
  13. 13.
    Underwood L. A brief history of search engines. 2004. Available at http://www.webreference.com/authoring/search_history.Google Scholar
  14. 14.
    Vidal Mrcio LA, da Silva AS, de Moura ES, Cavalcanti Joo MB. Structure-based crawling in the hidden web. J UCS 2008;14(11):1857–76.Google Scholar
  15. 15.
    Voorhees EM. Natural language processing and information retrieval. In: Pazienza MT, editor. Information extraction: towards scalable, adaptable systems. Berlin/Heidelberg: Springer; 1999. p. 32–48.CrossRefGoogle Scholar
  16. 16.
    Witten IH, Moffat A, Bell TC. Managing gigabytes: compressing and indexing documents and images. 2nd ed. Los Altos: Morgan Kaufmann; 1999.zbMATHGoogle Scholar
  17. 17.
    Zakon RH. Hobbes’ internet timeline. 2014. Available at http://zakon.org/robert/internet/timeline/.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Edleno Silva de Moura
    • 1
    Email author
  • Marco Antonio Cristo
    • 2
  1. 1.Federal University of AmazonasManausBrazil
  2. 2.FUCAPIManausBrazil