Skip to main content

Indexing the Web

  • Reference work entry
  • First Online:
Encyclopedia of Database Systems

Synonyms

Web indexing

Definition

The process of collecting, parsing, and storing data to provide fast and accurate retrieval of content available on the web. The result of this process is a structure called index that maps the collected data (for instance, words, phrases, concepts, or sound fragments) to the web location where it is possible to find content associated with the data (for instance, pages containing these words, phrases, concepts, or music with the sound fragments). Depending on the data collected, several indices may be created. The process can be manual or automatic. Manually generated indices include web directories, back-of-book-style indices, and metadata. Automatically generated indices are normally associated with the infrastructure of search engines.

Historical Background

One of the first efforts to index the web content was developed by a MIT student, Matthew Grey, who created a program to estimate the size of the web. This program, called World Wide Web...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 4,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 6,499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

  1. Baeza-Yates R, Castillo C, Marin M, Rodriguez A. Crawling a country: better strategies than breadth-first for web page ordering. In: Proceedings of the 14th International World Wide Web Conference; 2005. p. 864–72.

    Google Scholar 

  2. Baeza-Yates RA, Ribeiro-Neto B. Modern information retrieval. 2nd ed. New York/Toronto: Addison-Wesley; 2011.

    Google Scholar 

  3. Brin S, Page L. The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst. 1998;30(1–7):107–17.

    Article  Google Scholar 

  4. Deerwester S, Dumais ST, Landauer TK, Furnas GW, Harshman RA. Indexing by latent semantic analysis. J Soc Inf Sci. 1990;41(6):391–407.

    Article  Google Scholar 

  5. Heymann P, Koutrika G, Garcia-Molina H. Fighting spam on social web sites: a survey of approaches and future challenges. IEEE Internet Comput. 2007;11(6):36–45.

    Article  Google Scholar 

  6. Kleinberg JM. Authoritative sources in a hyperlinked environment. J ACM. 1999;46(5):604–32.

    Article  MathSciNet  MATH  Google Scholar 

  7. Liu Y, Zhang D, Lu G, Ma WY. A survey of content-based image retrieval with high-level semantics. Pattern Recognit. 2007;40(1):262–82.

    Article  MATH  Google Scholar 

  8. Manning CD, Raghavan P, Schütze H. Introduction to information retrieval, Ch. 18, 19, 20 (optional). Cambridge: Cambridge University Press; 2008.

    Book  MATH  Google Scholar 

  9. Mostafa J. Seeking better web searches. Sci Am Mag. 2005;292(2):50–87.

    Google Scholar 

  10. Salton G. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Boston: Addison-Wesley Longman Publishing Co.; 1989.

    Google Scholar 

  11. Sonnenreich W. A history of search engines. 1999. Available at http://www.wiley.com/legacy/compbooks/sonnenreich/history.html.

    Google Scholar 

  12. Souza J, Carvalho A, da Costa L, Cristo M, de Moura ES, Calado P, Chirita P-A, Nejdl W. Using site-level connections to estimate link confidence. JASIST 2012;63(11):2294–312.

    Article  Google Scholar 

  13. Underwood L. A brief history of search engines. 2004. Available at http://www.webreference.com/authoring/search_history.

    Google Scholar 

  14. Vidal Mrcio LA, da Silva AS, de Moura ES, Cavalcanti Joo MB. Structure-based crawling in the hidden web. J UCS 2008;14(11):1857–76.

    Google Scholar 

  15. Voorhees EM. Natural language processing and information retrieval. In: Pazienza MT, editor. Information extraction: towards scalable, adaptable systems. Berlin/Heidelberg: Springer; 1999. p. 32–48.

    Chapter  Google Scholar 

  16. Witten IH, Moffat A, Bell TC. Managing gigabytes: compressing and indexing documents and images. 2nd ed. Los Altos: Morgan Kaufmann; 1999.

    MATH  Google Scholar 

  17. Zakon RH. Hobbes’ internet timeline. 2014. Available at http://zakon.org/robert/internet/timeline/.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edleno Silva de Moura .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Moura, E.S.d., Cristo, M.A. (2018). Indexing the Web. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_1145

Download citation

Publish with us

Policies and ethics