Skip to main content

Scalability Challenges in Web Search Engines

  • Chapter
Advanced Topics in Information Retrieval

Part of the book series: The Information Retrieval Series ((INRE,volume 33))

Abstract

Continuous growth of the Web and user bases forces web search engine companies to make costly investments on very large compute infrastructures. The scalability of these infrastructures requires careful performance optimizations in every major component of the search engine. Herein, we try to provide a fairly comprehensive coverage of the literature on scalability challenges in large-scale web search engines. We present the identified challenges through an architectural classification, starting from a simple single-node search system and moving towards a hypothetical multi-site web search architecture. We also discuss a number of open research problems and provide recommendations to researchers in the field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The size of the indexed Web (visited on February 1, 2010), http://www.worldwidewebsize.com/.

References

  • Agrawal R, Gollapudi S, Halverson A, Ieong S (2009) Diversifying search results. In: Proceedings of the ACM Conference on Web Search and Data Mining. ACM Press, New York, NY, pp 5–14

    Chapter  Google Scholar 

  • Altingovde I, Ozcan R, Ulusoy O (2009) A cost-aware strategy for query result caching in web search engines. In: Boughanem M, Berrut C, Mothe J, Soule-Dupuy C (eds) Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science, vol 5478. Springer, Berlin/Heidelberg, pp 628–636

    Google Scholar 

  • Anh V, Moffat A (2004) Index compression using fixed binary codewords. In: Proceedings of the Australasian Database Conference. Australian Computer Society and Inc, Darlinghurst, pp 61–67

    Google Scholar 

  • Anh V, Moffat A (2006a) Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering 18(6):857–861

    Article  Google Scholar 

  • Anh V, Moffat A (2006b) Pruned query evaluation using pre-computed impacts. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 372–379

    Google Scholar 

  • Anh V, de Kretser O, Moffat A (2001) Vector-space ranking with effective early termination. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 35–42

    Google Scholar 

  • Arasu A, Cho J, Garcia-Molina H, Paepcke A, Raghavan S (2001) Searching the Web. ACM Transactions on Internet Technology 1(1):2–43

    Article  Google Scholar 

  • Badue C, Baeza-Yates R, Ribeiro-Neto BA, Ziviani N (2001) Distributed query processing using partitioned inverted files. In: Proceedings of the International Symposium on String Processing and Information Retrieval, pp 10–20

    Chapter  Google Scholar 

  • Badue C, Baeza-Yates R, Ribeiro-Neto BA, Ziviani A, Ziviani N (2007) Analyzing imbalance among homogeneous index servers in a web search system. Information Processing and Management 43(3):592–608

    Article  Google Scholar 

  • Baeza-Yates R, Ribeiro-Neto B (2010) Modern Information Retrieval, 2nd edn. Addison-Wesley, Reading, MA

    Google Scholar 

  • Baeza-Yates R, Saint-Jean F (2003) A three level search engine index based in query log distribution. In: Nascimento M, de Moura E, Oliveira A (eds) Proceedings of the International Symposium on String Processing Information Retrieval. Lecture Notes in Computer Science, vol 2857. Springer, Berlin/Heidelberg, pp 56–65

    Chapter  Google Scholar 

  • Baeza-Yates R, Junqueira F, Plachouras V, Witschel H (2007a) Admission policies for caches of search engine results. In: Ziviani N, Baeza-Yates R (eds) Proceedings of the International Symposium on String Processing Information Retrieval. Lecture Notes in Computer Science, vol 4726. Springer, Berlin/Heidelberg, pp 74–85

    Chapter  Google Scholar 

  • Baeza-Yates R, Gionis A, Junqueira F, Murdock V, Plachouras V, Silvestri F (2007b) The impact of caching on search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 183–190

    Google Scholar 

  • Baeza-Yates R, Castillo C, Junqueira F, Plachouras V, Silvestri F (2007c) Challenges in distributed information retrieval. In: Proceedings of the International Conference on Data Engineering. IEEE CS, New York, NY, pp 6–20

    Google Scholar 

  • Baeza-Yates R, Gionis A, Junqueira F, Plachouras V, Telloli L (2009a) On the feasibility of multi-site web search engines. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 425–434

    Chapter  Google Scholar 

  • Baeza-Yates R, Murdock V, Hauff C (2009b) Efficiency trade-offs in two-tier web search systems. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 163–170

    Google Scholar 

  • Barroso L, Hölzle U (2009) The Datacenter as a Computer. Synthesis Lectures on Computer Architecture. Morgan & Claypool

    Google Scholar 

  • Barroso L, Dean J, Hölzle U (2003) Web search for a planet: The Google cluster architecture. IEEE Micro 23(2):22–28

    Article  Google Scholar 

  • Bharat K, Broder AZ (1999) Mirror, mirror on the Web: A study of host pairs with replicated content. In: Proceedings of the International Conference on the World Wide Web. Elsevier/North-Holland, New York, NY, pp 1579–1590

    Google Scholar 

  • Bharat K, Broder A, Dean J, Henzinger M (2000) A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society for Information Science 51(12):1114–1122

    Article  Google Scholar 

  • Blanco R, Barreiro A (2006) TSP and cluster-based solutions to the reassignment of document identifiers. Journal of Information Retrieval 9(4):499–517

    Article  Google Scholar 

  • Blanco R, Bortnikov E, Junqueira F, Lempel R, Telloli L, Zaragoza H (2010) Caching search engine results over incremental indices. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 82–89

    Google Scholar 

  • Blandford D, Blelloch G (2002) Index compression through document reordering. In: Proceedings of the Data Compression Conference. IEEE Computer Society, Washington, DC, pp 342–351

    Google Scholar 

  • Boldi P, Codenotti B, Santini M, Vigna S (2004) UbiCrawler: a scalable fully distributed web crawler. Software: Practice and Experience 34(8):711–726

    Article  Google Scholar 

  • Boldi P, Bonchi F, Castillo C, Donato D, Gionis A, Vigna S (2008) The query-flow graph: Model and applications. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 609–618

    Google Scholar 

  • Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30(1–7):107–117

    Article  Google Scholar 

  • Broder A, Glassman S, Manasse M, Zweig G (1997) Syntactic clustering of the Web. Computer Networks and ISDN Systems 29:1157–1166

    Article  Google Scholar 

  • Broder A, Carmel D, Herscovici M, Soffer A, Zien J (2003a) Efficient query evaluation using a two-level retrieval process. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 426–434

    Google Scholar 

  • Broder A, Najork M, Wiener J (2003b) Efficient URL caching for World Wide Web crawling. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 679–689

    Google Scholar 

  • Brown E (1995) Fast evaluation of structured queries for information retrieval. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 30–38

    Google Scholar 

  • Buckley C, Lewit A (1985) Optimization of inverted vector searches. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 97–110

    Google Scholar 

  • Büttcher S, Clarke C (2005) Indexing time vs query time: trade-offs in dynamic information retrieval systems. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 317–318

    Google Scholar 

  • Büttcher S, Clarke C, Lushman B (2006a) Hybrid index maintenance for growing text collections. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 356–363

    Google Scholar 

  • Büttcher S, Clarke C, Lushman B (2006b) Term proximity scoring for ad-hoc retrieval on very large text collections. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 621–622

    Google Scholar 

  • Cacheda F, Carneiro V, Plachouras V, Ounis I (2007) Performance analysis of distributed information retrieval architectures using an improved network simulation model. Information Processing and Management 43(1):204–224

    Article  Google Scholar 

  • Cahoon B, McKinley K, Lu Z (2000) Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transactions on Information Systems 18(1):1–43

    Article  Google Scholar 

  • Callan J, Lu Z, Croft W (1995b) Searching distributed collections with inference networks. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 21–28

    Google Scholar 

  • Cambazoglu B, Aykanat C (2006) Performance of query processing implementations in ranking-based text retrieval systems using inverted indices. Information Processing and Management 42(4):875–898

    Article  Google Scholar 

  • Cambazoglu B, Turk A, Aykanat C (2004) Data-parallel web crawling models. In: Proceedings of the Symposium on Computer and Information Sciences. Lecture Notes in Computer Science. Springer, Berlin/Heidelberg, pp 801–809

    Google Scholar 

  • Cambazoglu B, Plachouras V, Junqueira F, Telloli L (2008) On the feasibility of geographically distributed web crawling. In: Proceedings of the International Conference on Scalable Information Systems. ICST (Institute for Computer Sciences and Social-Informatics and Telecommunications Engineering), ICST, Brussels, pp 1–10

    Google Scholar 

  • Cambazoglu B, Plachouras V, Baeza-Yates R (2009) Quantifying performance and quality gains in distributed web search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 411–418

    Google Scholar 

  • Cambazoglu B, Zaragoza H, Chapelle O, Chen J, Liao C, Zheng Z, Degenhardt J (2010a) Early exit optimizations for additive machine learned ranking systems. In: Proceedings of the ACM Conference on Web Search and Data Mining. ACM Press, New York, NY, pp 411–420

    Chapter  Google Scholar 

  • Cambazoglu B, Varol E, Kayaaslan E, Aykanat C, Baeza-Yates R (2010b) Query forwarding in geographically distributed search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 90–97

    Google Scholar 

  • Cambazoglu B, Junqueira F, Plachouras V, Banachowski S, Cui B, Lim S, Bridge B (2010c) A refreshing perspective of search engine caching. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 181–190

    Chapter  Google Scholar 

  • Carmel D, Cohen D, Fagin R, Farchi E, Herscovici M, Maarek Y, Soffer A (2001) Static index pruning for information retrieval systems. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 43–50

    Google Scholar 

  • Castillo C (2003) Cooperation schemes between a web server and a web search engine. In: Proceedings of the Latin American Conference on World Wide Web. IEEE CS, New York, NY, pp 212–213

    Google Scholar 

  • Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks and ISDN Systems 31(11–16):1623–1640

    Google Scholar 

  • Cho J, Garcia-Molina H (2000) The evolution of the Web and implications for an incremental crawler. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 200–209

    Google Scholar 

  • Cho J, Garcia-Molina H (2002) Parallel crawlers. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 124–135

    Google Scholar 

  • Cho J, Garcia-Molina H (2003) Effective page refresh policies for web crawlers. ACM Transactions on Database Systems 28(4):390–426

    Article  Google Scholar 

  • Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1–7):161–172

    Article  Google Scholar 

  • Cho J, Shivakumar N, Garcia-Molina H (2000) Finding replicated web collections. ACM SIGMOD Record 29(2):355–366

    Article  Google Scholar 

  • Chowdhury A, Pass G (2003) Operational requirements for scalable search systems. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 435–442

    Google Scholar 

  • Chowdhury A, Frieder O, Grossman D, McCabe M (2002) Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems 20(2):171–191

    Article  Google Scholar 

  • Chung C, Clarke CA (2002) Topic-oriented collaborative crawling. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 34–42

    Google Scholar 

  • Clarke CA, Cormack G, Burkowski F (1994) Fast inverted indexes with on-line update. Tech Rep CS-94-40, University of Waterloo

    Google Scholar 

  • Clarke CA, Agichtein E, Dumais S, White R (2007) The influence of caption features on clickthrough patterns in web search. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 135–142

    Google Scholar 

  • Cooper J, Coden A, Brown E (2002) Detecting similar documents using salient terms. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 245–251

    Google Scholar 

  • Cutting D, Pedersen J (1990) Optimization for dynamic inverted index maintenance. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 405–411

    Google Scholar 

  • Dasgupta A, Ghosh A, Kumar R, Olston C, Pandey S, Tomkins A (2007) The discoverability of the Web. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 421–430

    Chapter  Google Scholar 

  • de Kretser O, Moffat A, Shimmin T, Zobel J (1998) Methodologies for distributed information retrieval. In: Proceedings of the International Conference on Distributed Computing Systems. IEEE Computer Society, Washington, DC, p 66

    Google Scholar 

  • Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1)):107–113

    Article  Google Scholar 

  • Diligenti M, Coetzee F, Lawrence S, Giles C, Gori M (2000) Focused crawling using context graphs. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 527–534

    Google Scholar 

  • Ding S, Attenberg J, Suel T (2010) Scalable techniques for document identifier assignment in inverted indexes. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 311–320

    Chapter  Google Scholar 

  • D’Souza D, Thom J, Zobel J (2004) Collection selection for managed distributed document databases. Information Processing and Management 40(3):527–546

    Article  Google Scholar 

  • Edwards J, McCurley K, Tomlin J (2001) An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 106–113

    Google Scholar 

  • Eichmann D (1995) Ethical web agents. Computer Networks and ISDN Systems 28(1–2):127–136

    Article  Google Scholar 

  • Exposto J, Macedo J, Pina A, Alves A, Rufino J (2005) Geographical partition for distributed web crawling. In: Proceedings of the Workshop on Geographic Information Retrieval. ACM Press, New York, NY, pp 55–60

    Chapter  Google Scholar 

  • Exposto J, Macedo J, Pina A, Alves A, Rufino J (2008) Efficient partitioning strategies for distributed web crawling. In: Proceedings of the International Conference on Information Networking: Towards Ubiquitous Networking and Services. Lecture Notes in Computer Science. Springer, Berlin/Heidelberg, pp 544–553

    Chapter  Google Scholar 

  • Fagni T, Perego R, Silvestri F, Orlando S (2006) Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems 24(1):51–78

    Article  Google Scholar 

  • Fetterly D, Manasse M, Najork M, Wiener J (2004) A large-scale study of the evolution of web pages. Software: Practice and Experience 34(2):213–237

    Article  Google Scholar 

  • Fetterly D, Craswell N, Vinay V (2009) The impact of crawl policy on web search effectiveness. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 580–587

    Google Scholar 

  • Fox E, Lee W (1991) FAST-INV: A fast algorithm for building large inverted files. Tech Rep 91–10, Virginia Polytechnic Institute and State University

    Google Scholar 

  • Gan Q, Suel T (2009) Improved techniques for result caching in web search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 431–440

    Google Scholar 

  • Gao W, Lee H, Miao Y (2006) Geographically focused collaborative crawling. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 287–296

    Chapter  Google Scholar 

  • Gravano L, Garcia-Molina H (1995) Generalizing GlOSS to vector-space databases and broker hierarchies. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 78–89

    Google Scholar 

  • Gyöngyi Z, Garcia-Molina H (2005a) Link spam alliances. In: Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment, pp 517–528

    Google Scholar 

  • Gyöngyi Z, Garcia-Molina H (2005b) Web spam taxonomy. http://airweb.cse.lehigh.edu/2005/gyongyi.pdf, visited on February, 2011

  • Gyöngyi Z, Garcia-Molina H, Pedersen J (2004) Combating web spam with TrustRank. In: Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment, pp 576–587

    Google Scholar 

  • Harman D, Candela G (1990) Retrieving records from a gigabyte of text on a mini-computer using statistical ranking. Journal of the American Society for Information Science 41(8):581–589

    Article  Google Scholar 

  • Harman D, Baeza-Yates R, Fox E, Lee W (1992) Inverted files. In: Baeza-Yates WBFR (ed) Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Upper Saddle River, NJ, pp 28–43

    Google Scholar 

  • Hawking D (1997) Scalable text retrieval for large digital libraries. In: Proceedings of the European Conference on Digital Libraries. Springer, London, pp 127–145

    Google Scholar 

  • Heinz S, Zobel J (2003) Efficient single-pass index construction for text databases. Journal of the American Society for Information Science 54(8):713–729

    Google Scholar 

  • Henzinger M (2006) Finding near-duplicate web pages: A large-scale evaluation of algorithms. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 284–291

    Google Scholar 

  • Heydon A, Najork M (1999) Mercator: a scalable, extensible web crawler. World Wide Web 2(4):219–229

    Article  Google Scholar 

  • Hirai J, Raghavan S, Garcia-Molina H, Paepcke A (2000) WebBase: a repository of web pages. In: Proceedings of the International Conference on the World Wide Web. North-Holland, Amsterdam, pp 277–293

    Google Scholar 

  • Jeh G, Widom J (2003) Scaling personalized web search. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 271–279

    Google Scholar 

  • Jeong BS, Omiecinski E (1995) Inverted file partitioning schemes in multiple disk systems. IEEE Transactions on Parallel and Distributed Systems 6(2):142–153

    Article  Google Scholar 

  • Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of the ACM Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, pp 133–142

    Google Scholar 

  • Jónsson B, Franklin M, Srivastava D (1998) Interaction of query evaluation and buffer management for information retrieval. ACM SIGMOD Record 27(2):118–129

    Article  Google Scholar 

  • Kayaaslan E, Cambazoglu B, Aykanat C (2010) Document replication strategies for geographically distributed Web search engines. To be submitted

    Google Scholar 

  • Kulkarni A, Callan J (2010) Topic-based index partitions for efficient and effective selective search. http://www.lsdsir.org/, visited on February, 2011

  • Larkey L, Connell M, Callan J (2000) Collection selection and results merging with topically organized US patents and TREC data. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 282–289

    Google Scholar 

  • Lawrence S, Giles C (2000) Accessibility of information on the Web. Intelligence 11(1):32–39

    Article  Google Scholar 

  • Lee HT, Leonard D, Wang X, Loguinov D (2008) IRLbot: Scaling to 6 billion pages and beyond. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 427–436

    Google Scholar 

  • Lempel R, Moran S (2003) Predictive caching and prefetching of query results in search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 19–28

    Google Scholar 

  • Lester N, Zobel J, Williams H (2004) In-place versus re-build versus re-merge: Index maintenance strategies for text retrieval systems. In: Proceedings of the Australasian Database Conference. Australian Computer Society, Darlinghurst, pp 15–23

    Google Scholar 

  • Lester N, Moffat A, Zobel J (2008) Efficient online index construction for text databases. ACM Transactions on Database Systems 33(3):1–33

    Article  Google Scholar 

  • Lewandowskii D (2008) A three-year study on the freshness of web search engine databases. Journal of Information Science 34(6):817–831

    Article  Google Scholar 

  • Liu X, Croft W (2004) Cluster-based retrieval using language models. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 186–193

    Google Scholar 

  • Liu F, Yu C, Meng W (2002) Personalized web search by mapping user queries to categories. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 558–565

    Google Scholar 

  • Long X, Suel T (2005) Three-level caching for efficient query processing in large web search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 257–266

    Chapter  Google Scholar 

  • Lu Z, McKinley K (1999) Partial replica selection based on relevance for information retrieval. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 97–104

    Google Scholar 

  • Lu Z, McKinley K (2000) Partial collection replication versus caching for information retrieval systems. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 248–255

    Google Scholar 

  • Lucchese C, Orlando S, Perego R, Silvestri F (2007) Mining query logs to optimize index partitioning in parallel web search engines. In: Proceedings of the International Conference on Scalable Information Systems. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), Brussels, Belgium, pp 1–9

    Google Scholar 

  • MacFarlane A, McCann J, Robertson S (2000) Parallel search using partitioned inverted files. In: Proceedings of the International Symposium on String Processing Information Retrieval. IEEE Computer Society, Washington, DC, pp 209–220

    Chapter  Google Scholar 

  • Markatos E (2001) On caching search engine query results. Computer Communications 24(2):137–143

    Article  Google Scholar 

  • Melnik S, Raghavan S, Yang B, Garcia-Molina H (2001) Building a distributed full-text index for the Web. ACM Transactions on Information Systems 19(3):217–241

    Article  Google Scholar 

  • Moffat A, Bell TH (1995) In situ generation of compressed inverted files. Journal of the American Society for Information Science 46(7):537–550

    Article  Google Scholar 

  • Moffat A, Stuiver L (2000) Binary interpolative coding for effective index compression. Journal of Information Retrieval 3(1):25–47

    Article  Google Scholar 

  • Moffat A, Zobel J (1996) Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems 14(4):349–379

    Article  Google Scholar 

  • Moffat A, Webber W, Zobel J, Baeza-Yates R (2007) A pipelined architecture for distributed text query evaluation. Journal of Information Retrieval 10(3):205–231

    Article  Google Scholar 

  • Najork M, Wiener J (2001) Breadth-first crawling yields high-quality pages. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 114–118

    Google Scholar 

  • Ntoulas A, Cho J (2007) Pruning policies for two-tiered inverted index with correctness guarantee. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 191–198

    Google Scholar 

  • Ntoulas A, Cho J, Olston C (2004) What’s new on the Web?: The evolution of the Web from a search engine perspective. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 1–12

    Google Scholar 

  • Olston C, Pandey S (2008) Recrawl scheduling based on information longevity. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 437–446

    Google Scholar 

  • Ozcan R, Altingovde I, Ulusoy O (2008) Static query result caching revisited. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 1169–1170

    Google Scholar 

  • Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: Bringing order to the Web. http://ilpubs.stanford.edu:8090/422/, visited on February, 2011

  • Pandey S, Olston C (2005) User-centric web crawling. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 401–411

    Chapter  Google Scholar 

  • Pandey S, Olston C (2008) Crawl ordering by search impact. In: Proceedings of the ACM Conference on web Search and Data Mining. ACM Press, New York, NY, pp 3–14

    Chapter  Google Scholar 

  • Persin M (1994) Document filtering for fast ranking. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 339–348

    Google Scholar 

  • Pitkow J, Schütze H, Cass T, Cooley R, Turnbull D, Edmonds A, Adar E, Breuel T (2002) Personalized search. Communications of the ACM 45(9):50–55

    Google Scholar 

  • Puppin D, Silvestri F, Perego R, Baeza-Yates R (2010) Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Transactions on Information Systems 28(2):1–36

    Article  Google Scholar 

  • Radoslavov P, Govindan R, Estrin D (2002) Topology-informed Internet replica placement. Computer Communications 25(4):384–392

    Article  Google Scholar 

  • Rafiei D, Bharat K, Shukla A (2010) Diversifying web search results. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 781–790

    Chapter  Google Scholar 

  • Raghavan S, Garcia-Molina H (2001) Crawling the hidden Web. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 129–138

    Google Scholar 

  • Rasolofo Y, Savoy J (2003) Term proximity scoring for keyword-based retrieval systems. In: Sebastiani F (ed) Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science, vol 2633. Springer, Berlin/Heidelberg, pp 79. doi:10.1007/3-540-36618-0_15, visited on December, 2010

    Google Scholar 

  • Ribeiro-Neto B, Barbosa R (1998) Query performance for tightly coupled distributed digital libraries. In: Proceedings of the ACM Conference on Digital Libraries. ACM Press, New York, NY, pp 182–190

    Chapter  Google Scholar 

  • Ribeiro-Neto B, Kitajima J, Navarro G, Sant’Ana C, Ziviani N (1998) Parallel generation of inverted files for distributed text collections. In: Proceedings of the Conference of the Chilean Computer Science Society. IEEE Computer Society, Washington, DC, pp 149–157

    Google Scholar 

  • Ribeiro-Neto B, Moura E, Neubert M, Ziviani N (1999) Efficient distributed algorithms to build inverted files. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 105–112

    Google Scholar 

  • Risvik K, Aasheim Y, Lidal M (2003) Multi-tier architecture for web search engines. In: Proceedings of the Latin American Conference on World Wide Web. IEEE Computer Society, Washington, DC, p 132

    Google Scholar 

  • Saraiva P, Silva de Moura E, Ziviani N, Meira W, Fonseca R, Riberio-Neto B (2001) Rank-preserving two-level caching for scalable search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 51–58

    Google Scholar 

  • Sarigiannis C, Plachouras V, Baeza-Yates R (2009) A study of the impact of index updates on distributed query processing for web search. In: Proceedings of the European Conference on Information Retrieval. Springer, Berlin/Heidelberg, pp 595–602

    Google Scholar 

  • Schenkel R, Broschart A, Hwang S, Theobald M, Weikum G (2007) Efficient text proximity search. In: Proceedings of the International Symposium on String Processing Information Retrieval. Lecture Notes in Computer Science, vol 4726. Springer, Berlin/Heidelberg, pp 287–299

    Chapter  Google Scholar 

  • Scholer F, Williams H, Yiannis J, Zobel J (2002) Compression of inverted indexes for fast query evaluation. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 222–229

    Google Scholar 

  • Schurman E, Brutlag J (2009) Performance related changes and their user impact. http://velocityconference.blip.tv/file/2279751/, visited on February, 2011

  • Shieh WY, Chung CP (2005) A statistics-based approach to incrementally update inverted files. Information Processing and Management 41(2):275–288

    Google Scholar 

  • Shieh WY, Chen TF, Shann J, Chung CP (2003) Inverted file compression through document identifier reassignment. Information Processing and Management 39(1):117–131

    Article  MATH  Google Scholar 

  • Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceedings of the International Conference on Data Engineering. IEEE Computer Society, Washington, DC, p 357

    Google Scholar 

  • Si L, Jin R, Callan J, Ogilvie P (2002a) A language modeling framework for resource selection and results merging. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 391–397

    Google Scholar 

  • Silvestri F (2007) Sorting out the document identifier assignment problem. In: Amati G, Carpineto C, Romano G (eds) Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science, vol 4425. Springer, Berlin/Heidelberg, pp 101–112

    Google Scholar 

  • Silvestri F, Orlando S, Perego R (2004) Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 305–312

    Google Scholar 

  • Skobeltsyn G, Junqueira F, Plachouras V, Baeza-Yates R (2008) ResIn: a combination of results caching and index pruning for high-performance web search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 131–138

    Google Scholar 

  • Strohman T, Turtle H, Croft W (2005) Optimization strategies for complex queries. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 219–225

    Google Scholar 

  • Sun JT, Zeng HJ, Liu H, Lu Y, Chen Z (2005) CubeSVD: a novel approach to personalized web search. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 382–390

    Chapter  Google Scholar 

  • Tan B, Shen X, Zhai C (2006) Mining long-term search history to improve search accuracy. In: Proceedings of the ACM Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, pp 718–723

    Chapter  Google Scholar 

  • Teevan J, Dumais S, Horvitz E (2005) Personalizing search via automated analysis of interests and activities. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 449–456

    Google Scholar 

  • Tomasic A, Garcia-Molina H (1993) Caching and database scaling in distributed shared-nothing information retrieval systems. ACM SIGMOD Record 22(2):129–138

    Article  Google Scholar 

  • Tomasic A, Garcia-Molina H, Shoens K (1994) Incremental updates of inverted lists for text document retrieval. In: Proceedings of the ACM Conference on Management of Data. ACM Press, New York, NY, pp 289–300

    Google Scholar 

  • Tomasic A, Gravano L, Lue C, Schwarz P, Haas L (1997) Data structures for efficient broker implementation. ACM Transactions on Information Systems 15(3):223–253

    Article  Google Scholar 

  • Tonellotto N, Macdonald C, Ounis I (2010) Efficient dynamic pruning with proximity support. http://www.lsdsir.org/wp-content/uploads/2010/05/lsdsir10-5.pdf, visited on February, 2011

  • Turpin A, Tsegay Y, Hawking D, Williams H (2007) Fast generation of result snippets in web search. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 127–134

    Google Scholar 

  • Turtle H, Flood J (1995) Query evaluation: Strategies and optimizations. Information Processing and Management 31(6):831–850

    Article  Google Scholar 

  • Varadarajan R, Hristidis V (2006) A system for query-specific document summarization. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 622–631

    Chapter  Google Scholar 

  • Wang L, Lin J, Metzler D (2010) Learning to efficiently rank. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 138–145

    Google Scholar 

  • Witten I, Moffat A, Bell T (1999) Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco, CA

    Google Scholar 

  • Wolf J, Squillante M, Yu P, Sethuraman J, Ozsen L (2002) Optimal crawling strategies for web search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 136–147

    Google Scholar 

  • Wong WP, Lee D (1993) Implementations of partial document ranking using inverted files. Information Processing and Management 29(5):647–669

    Article  Google Scholar 

  • Xu J, Callan J (1998) Effective retrieval with distributed collections. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 112–120

    Google Scholar 

  • Xu J, Croft W (1999) Cluster-based language models for distributed retrieval. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 254–261

    Google Scholar 

  • Yan H, Ding S, Suel T (2009a) Compressing term positions in web indexes. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 147–154

    Google Scholar 

  • Yan H, Ding S, Suel T (2009b) Inverted index compression and query processing with optimized document ordering. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 401–410

    Google Scholar 

  • Yu F, Xie Y, Ke Q (2010) Sbotminer: Large scale search bot detection. In: Proceedings of the ACM Conference on web Search and Data Mining. ACM Press, New York, NY, pp 421–430

    Chapter  Google Scholar 

  • Yuwono B, Lee D (1997) Server ranking for distributed text retrieval systems on the Internet. In: Proceedings of the International Conference on Database Systems for Advanced Applications. World Scientific, Singapore, pp 41–50

    Chapter  Google Scholar 

  • Zeinalipour-Yazti D, Dikaiakos M (2002) Design and implementation of a distributed crawler and filtering processor. In: Proceedings of the International Workshop on Next Generation Information Technologies and Systems. Springer, London, pp 58–74

    Chapter  Google Scholar 

  • Zhang J, Long X, Suel T (2008) Performance of compressed inverted list caching in search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 387–396

    Google Scholar 

  • Zobel J, Moffat A (2006) Inverted files for text search engines. ACM Computing Surveys 38(2):6

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Berkant Barla Cambazoglu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Cambazoglu, B.B., Baeza-Yates, R. (2011). Scalability Challenges in Web Search Engines. In: Melucci, M., Baeza-Yates, R. (eds) Advanced Topics in Information Retrieval. The Information Retrieval Series, vol 33. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20946-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20946-8_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20945-1

  • Online ISBN: 978-3-642-20946-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics