Scalability Challenges in Web Search Engines

Cambazoglu, Berkant Barla; Baeza-Yates, Ricardo

doi:10.1007/978-3-642-20946-8_2

Berkant Barla Cambazoglu³ &
Ricardo Baeza-Yates³

Part of the book series: The Information Retrieval Series ((INRE,volume 33))

2148 Accesses
21 Citations
4 Altmetric

Abstract

Continuous growth of the Web and user bases forces web search engine companies to make costly investments on very large compute infrastructures. The scalability of these infrastructures requires careful performance optimizations in every major component of the search engine. Herein, we try to provide a fairly comprehensive coverage of the literature on scalability challenges in large-scale web search engines. We present the identified challenges through an architectural classification, starting from a simple single-node search system and moving towards a hypothetical multi-site web search architecture. We also discuss a number of open research problems and provide recommendations to researchers in the field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The size of the indexed Web (visited on February 1, 2010), http://www.worldwidewebsize.com/.

References

Agrawal R, Gollapudi S, Halverson A, Ieong S (2009) Diversifying search results. In: Proceedings of the ACM Conference on Web Search and Data Mining. ACM Press, New York, NY, pp 5–14
Chapter Google Scholar
Altingovde I, Ozcan R, Ulusoy O (2009) A cost-aware strategy for query result caching in web search engines. In: Boughanem M, Berrut C, Mothe J, Soule-Dupuy C (eds) Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science, vol 5478. Springer, Berlin/Heidelberg, pp 628–636
Google Scholar
Anh V, Moffat A (2004) Index compression using fixed binary codewords. In: Proceedings of the Australasian Database Conference. Australian Computer Society and Inc, Darlinghurst, pp 61–67
Google Scholar
Anh V, Moffat A (2006a) Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering 18(6):857–861
Article Google Scholar
Anh V, Moffat A (2006b) Pruned query evaluation using pre-computed impacts. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 372–379
Google Scholar
Anh V, de Kretser O, Moffat A (2001) Vector-space ranking with effective early termination. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 35–42
Google Scholar
Arasu A, Cho J, Garcia-Molina H, Paepcke A, Raghavan S (2001) Searching the Web. ACM Transactions on Internet Technology 1(1):2–43
Article Google Scholar
Badue C, Baeza-Yates R, Ribeiro-Neto BA, Ziviani N (2001) Distributed query processing using partitioned inverted files. In: Proceedings of the International Symposium on String Processing and Information Retrieval, pp 10–20
Chapter Google Scholar
Badue C, Baeza-Yates R, Ribeiro-Neto BA, Ziviani A, Ziviani N (2007) Analyzing imbalance among homogeneous index servers in a web search system. Information Processing and Management 43(3):592–608
Article Google Scholar
Baeza-Yates R, Ribeiro-Neto B (2010) Modern Information Retrieval, 2nd edn. Addison-Wesley, Reading, MA
Google Scholar
Baeza-Yates R, Saint-Jean F (2003) A three level search engine index based in query log distribution. In: Nascimento M, de Moura E, Oliveira A (eds) Proceedings of the International Symposium on String Processing Information Retrieval. Lecture Notes in Computer Science, vol 2857. Springer, Berlin/Heidelberg, pp 56–65
Chapter Google Scholar
Baeza-Yates R, Junqueira F, Plachouras V, Witschel H (2007a) Admission policies for caches of search engine results. In: Ziviani N, Baeza-Yates R (eds) Proceedings of the International Symposium on String Processing Information Retrieval. Lecture Notes in Computer Science, vol 4726. Springer, Berlin/Heidelberg, pp 74–85
Chapter Google Scholar
Baeza-Yates R, Gionis A, Junqueira F, Murdock V, Plachouras V, Silvestri F (2007b) The impact of caching on search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 183–190
Google Scholar
Baeza-Yates R, Castillo C, Junqueira F, Plachouras V, Silvestri F (2007c) Challenges in distributed information retrieval. In: Proceedings of the International Conference on Data Engineering. IEEE CS, New York, NY, pp 6–20
Google Scholar
Baeza-Yates R, Gionis A, Junqueira F, Plachouras V, Telloli L (2009a) On the feasibility of multi-site web search engines. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 425–434
Chapter Google Scholar
Baeza-Yates R, Murdock V, Hauff C (2009b) Efficiency trade-offs in two-tier web search systems. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 163–170
Google Scholar
Barroso L, Hölzle U (2009) The Datacenter as a Computer. Synthesis Lectures on Computer Architecture. Morgan & Claypool
Google Scholar
Barroso L, Dean J, Hölzle U (2003) Web search for a planet: The Google cluster architecture. IEEE Micro 23(2):22–28
Article Google Scholar
Bharat K, Broder AZ (1999) Mirror, mirror on the Web: A study of host pairs with replicated content. In: Proceedings of the International Conference on the World Wide Web. Elsevier/North-Holland, New York, NY, pp 1579–1590
Google Scholar
Bharat K, Broder A, Dean J, Henzinger M (2000) A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society for Information Science 51(12):1114–1122
Article Google Scholar
Blanco R, Barreiro A (2006) TSP and cluster-based solutions to the reassignment of document identifiers. Journal of Information Retrieval 9(4):499–517
Article Google Scholar
Blanco R, Bortnikov E, Junqueira F, Lempel R, Telloli L, Zaragoza H (2010) Caching search engine results over incremental indices. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 82–89
Google Scholar
Blandford D, Blelloch G (2002) Index compression through document reordering. In: Proceedings of the Data Compression Conference. IEEE Computer Society, Washington, DC, pp 342–351
Google Scholar
Boldi P, Codenotti B, Santini M, Vigna S (2004) UbiCrawler: a scalable fully distributed web crawler. Software: Practice and Experience 34(8):711–726
Article Google Scholar
Boldi P, Bonchi F, Castillo C, Donato D, Gionis A, Vigna S (2008) The query-flow graph: Model and applications. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 609–618
Google Scholar
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30(1–7):107–117
Article Google Scholar
Broder A, Glassman S, Manasse M, Zweig G (1997) Syntactic clustering of the Web. Computer Networks and ISDN Systems 29:1157–1166
Article Google Scholar
Broder A, Carmel D, Herscovici M, Soffer A, Zien J (2003a) Efficient query evaluation using a two-level retrieval process. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 426–434
Google Scholar
Broder A, Najork M, Wiener J (2003b) Efficient URL caching for World Wide Web crawling. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 679–689
Google Scholar
Brown E (1995) Fast evaluation of structured queries for information retrieval. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 30–38
Google Scholar
Buckley C, Lewit A (1985) Optimization of inverted vector searches. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 97–110
Google Scholar
Büttcher S, Clarke C (2005) Indexing time vs query time: trade-offs in dynamic information retrieval systems. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 317–318
Google Scholar
Büttcher S, Clarke C, Lushman B (2006a) Hybrid index maintenance for growing text collections. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 356–363
Google Scholar
Büttcher S, Clarke C, Lushman B (2006b) Term proximity scoring for ad-hoc retrieval on very large text collections. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 621–622
Google Scholar
Cacheda F, Carneiro V, Plachouras V, Ounis I (2007) Performance analysis of distributed information retrieval architectures using an improved network simulation model. Information Processing and Management 43(1):204–224
Article Google Scholar
Cahoon B, McKinley K, Lu Z (2000) Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transactions on Information Systems 18(1):1–43
Article Google Scholar
Callan J, Lu Z, Croft W (1995b) Searching distributed collections with inference networks. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 21–28
Google Scholar
Cambazoglu B, Aykanat C (2006) Performance of query processing implementations in ranking-based text retrieval systems using inverted indices. Information Processing and Management 42(4):875–898
Article Google Scholar
Cambazoglu B, Turk A, Aykanat C (2004) Data-parallel web crawling models. In: Proceedings of the Symposium on Computer and Information Sciences. Lecture Notes in Computer Science. Springer, Berlin/Heidelberg, pp 801–809
Google Scholar
Cambazoglu B, Plachouras V, Junqueira F, Telloli L (2008) On the feasibility of geographically distributed web crawling. In: Proceedings of the International Conference on Scalable Information Systems. ICST (Institute for Computer Sciences and Social-Informatics and Telecommunications Engineering), ICST, Brussels, pp 1–10
Google Scholar
Cambazoglu B, Plachouras V, Baeza-Yates R (2009) Quantifying performance and quality gains in distributed web search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 411–418
Google Scholar
Cambazoglu B, Zaragoza H, Chapelle O, Chen J, Liao C, Zheng Z, Degenhardt J (2010a) Early exit optimizations for additive machine learned ranking systems. In: Proceedings of the ACM Conference on Web Search and Data Mining. ACM Press, New York, NY, pp 411–420
Chapter Google Scholar
Cambazoglu B, Varol E, Kayaaslan E, Aykanat C, Baeza-Yates R (2010b) Query forwarding in geographically distributed search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 90–97
Google Scholar
Cambazoglu B, Junqueira F, Plachouras V, Banachowski S, Cui B, Lim S, Bridge B (2010c) A refreshing perspective of search engine caching. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 181–190
Chapter Google Scholar
Carmel D, Cohen D, Fagin R, Farchi E, Herscovici M, Maarek Y, Soffer A (2001) Static index pruning for information retrieval systems. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 43–50
Google Scholar
Castillo C (2003) Cooperation schemes between a web server and a web search engine. In: Proceedings of the Latin American Conference on World Wide Web. IEEE CS, New York, NY, pp 212–213
Google Scholar
Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks and ISDN Systems 31(11–16):1623–1640
Google Scholar
Cho J, Garcia-Molina H (2000) The evolution of the Web and implications for an incremental crawler. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 200–209
Google Scholar
Cho J, Garcia-Molina H (2002) Parallel crawlers. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 124–135
Google Scholar
Cho J, Garcia-Molina H (2003) Effective page refresh policies for web crawlers. ACM Transactions on Database Systems 28(4):390–426
Article Google Scholar
Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1–7):161–172
Article Google Scholar
Cho J, Shivakumar N, Garcia-Molina H (2000) Finding replicated web collections. ACM SIGMOD Record 29(2):355–366
Article Google Scholar
Chowdhury A, Pass G (2003) Operational requirements for scalable search systems. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 435–442
Google Scholar
Chowdhury A, Frieder O, Grossman D, McCabe M (2002) Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems 20(2):171–191
Article Google Scholar
Chung C, Clarke CA (2002) Topic-oriented collaborative crawling. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 34–42
Google Scholar
Clarke CA, Cormack G, Burkowski F (1994) Fast inverted indexes with on-line update. Tech Rep CS-94-40, University of Waterloo
Google Scholar
Clarke CA, Agichtein E, Dumais S, White R (2007) The influence of caption features on clickthrough patterns in web search. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 135–142
Google Scholar
Cooper J, Coden A, Brown E (2002) Detecting similar documents using salient terms. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 245–251
Google Scholar
Cutting D, Pedersen J (1990) Optimization for dynamic inverted index maintenance. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 405–411
Google Scholar
Dasgupta A, Ghosh A, Kumar R, Olston C, Pandey S, Tomkins A (2007) The discoverability of the Web. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 421–430
Chapter Google Scholar
de Kretser O, Moffat A, Shimmin T, Zobel J (1998) Methodologies for distributed information retrieval. In: Proceedings of the International Conference on Distributed Computing Systems. IEEE Computer Society, Washington, DC, p 66
Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1)):107–113
Article Google Scholar
Diligenti M, Coetzee F, Lawrence S, Giles C, Gori M (2000) Focused crawling using context graphs. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 527–534
Google Scholar
Ding S, Attenberg J, Suel T (2010) Scalable techniques for document identifier assignment in inverted indexes. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 311–320
Chapter Google Scholar
D’Souza D, Thom J, Zobel J (2004) Collection selection for managed distributed document databases. Information Processing and Management 40(3):527–546
Article Google Scholar
Edwards J, McCurley K, Tomlin J (2001) An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 106–113
Google Scholar
Eichmann D (1995) Ethical web agents. Computer Networks and ISDN Systems 28(1–2):127–136
Article Google Scholar
Exposto J, Macedo J, Pina A, Alves A, Rufino J (2005) Geographical partition for distributed web crawling. In: Proceedings of the Workshop on Geographic Information Retrieval. ACM Press, New York, NY, pp 55–60
Chapter Google Scholar
Exposto J, Macedo J, Pina A, Alves A, Rufino J (2008) Efficient partitioning strategies for distributed web crawling. In: Proceedings of the International Conference on Information Networking: Towards Ubiquitous Networking and Services. Lecture Notes in Computer Science. Springer, Berlin/Heidelberg, pp 544–553
Chapter Google Scholar
Fagni T, Perego R, Silvestri F, Orlando S (2006) Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems 24(1):51–78
Article Google Scholar
Fetterly D, Manasse M, Najork M, Wiener J (2004) A large-scale study of the evolution of web pages. Software: Practice and Experience 34(2):213–237
Article Google Scholar
Fetterly D, Craswell N, Vinay V (2009) The impact of crawl policy on web search effectiveness. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 580–587
Google Scholar
Fox E, Lee W (1991) FAST-INV: A fast algorithm for building large inverted files. Tech Rep 91–10, Virginia Polytechnic Institute and State University
Google Scholar
Gan Q, Suel T (2009) Improved techniques for result caching in web search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 431–440
Google Scholar
Gao W, Lee H, Miao Y (2006) Geographically focused collaborative crawling. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 287–296
Chapter Google Scholar
Gravano L, Garcia-Molina H (1995) Generalizing GlOSS to vector-space databases and broker hierarchies. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 78–89
Google Scholar
Gyöngyi Z, Garcia-Molina H (2005a) Link spam alliances. In: Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment, pp 517–528
Google Scholar
Gyöngyi Z, Garcia-Molina H (2005b) Web spam taxonomy. http://airweb.cse.lehigh.edu/2005/gyongyi.pdf, visited on February, 2011
Gyöngyi Z, Garcia-Molina H, Pedersen J (2004) Combating web spam with TrustRank. In: Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment, pp 576–587
Google Scholar
Harman D, Candela G (1990) Retrieving records from a gigabyte of text on a mini-computer using statistical ranking. Journal of the American Society for Information Science 41(8):581–589
Article Google Scholar
Harman D, Baeza-Yates R, Fox E, Lee W (1992) Inverted files. In: Baeza-Yates WBFR (ed) Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Upper Saddle River, NJ, pp 28–43
Google Scholar
Hawking D (1997) Scalable text retrieval for large digital libraries. In: Proceedings of the European Conference on Digital Libraries. Springer, London, pp 127–145
Google Scholar
Heinz S, Zobel J (2003) Efficient single-pass index construction for text databases. Journal of the American Society for Information Science 54(8):713–729
Google Scholar
Henzinger M (2006) Finding near-duplicate web pages: A large-scale evaluation of algorithms. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 284–291
Google Scholar
Heydon A, Najork M (1999) Mercator: a scalable, extensible web crawler. World Wide Web 2(4):219–229
Article Google Scholar
Hirai J, Raghavan S, Garcia-Molina H, Paepcke A (2000) WebBase: a repository of web pages. In: Proceedings of the International Conference on the World Wide Web. North-Holland, Amsterdam, pp 277–293
Google Scholar
Jeh G, Widom J (2003) Scaling personalized web search. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 271–279
Google Scholar
Jeong BS, Omiecinski E (1995) Inverted file partitioning schemes in multiple disk systems. IEEE Transactions on Parallel and Distributed Systems 6(2):142–153
Article Google Scholar
Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of the ACM Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, pp 133–142
Google Scholar
Jónsson B, Franklin M, Srivastava D (1998) Interaction of query evaluation and buffer management for information retrieval. ACM SIGMOD Record 27(2):118–129
Article Google Scholar
Kayaaslan E, Cambazoglu B, Aykanat C (2010) Document replication strategies for geographically distributed Web search engines. To be submitted
Google Scholar
Kulkarni A, Callan J (2010) Topic-based index partitions for efficient and effective selective search. http://www.lsdsir.org/, visited on February, 2011
Larkey L, Connell M, Callan J (2000) Collection selection and results merging with topically organized US patents and TREC data. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 282–289
Google Scholar
Lawrence S, Giles C (2000) Accessibility of information on the Web. Intelligence 11(1):32–39
Article Google Scholar
Lee HT, Leonard D, Wang X, Loguinov D (2008) IRLbot: Scaling to 6 billion pages and beyond. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 427–436
Google Scholar
Lempel R, Moran S (2003) Predictive caching and prefetching of query results in search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 19–28
Google Scholar
Lester N, Zobel J, Williams H (2004) In-place versus re-build versus re-merge: Index maintenance strategies for text retrieval systems. In: Proceedings of the Australasian Database Conference. Australian Computer Society, Darlinghurst, pp 15–23
Google Scholar
Lester N, Moffat A, Zobel J (2008) Efficient online index construction for text databases. ACM Transactions on Database Systems 33(3):1–33
Article Google Scholar
Lewandowskii D (2008) A three-year study on the freshness of web search engine databases. Journal of Information Science 34(6):817–831
Article Google Scholar
Liu X, Croft W (2004) Cluster-based retrieval using language models. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 186–193
Google Scholar
Liu F, Yu C, Meng W (2002) Personalized web search by mapping user queries to categories. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 558–565
Google Scholar
Long X, Suel T (2005) Three-level caching for efficient query processing in large web search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 257–266
Chapter Google Scholar
Lu Z, McKinley K (1999) Partial replica selection based on relevance for information retrieval. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 97–104
Google Scholar
Lu Z, McKinley K (2000) Partial collection replication versus caching for information retrieval systems. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 248–255
Google Scholar
Lucchese C, Orlando S, Perego R, Silvestri F (2007) Mining query logs to optimize index partitioning in parallel web search engines. In: Proceedings of the International Conference on Scalable Information Systems. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), Brussels, Belgium, pp 1–9
Google Scholar
MacFarlane A, McCann J, Robertson S (2000) Parallel search using partitioned inverted files. In: Proceedings of the International Symposium on String Processing Information Retrieval. IEEE Computer Society, Washington, DC, pp 209–220
Chapter Google Scholar
Markatos E (2001) On caching search engine query results. Computer Communications 24(2):137–143
Article Google Scholar
Melnik S, Raghavan S, Yang B, Garcia-Molina H (2001) Building a distributed full-text index for the Web. ACM Transactions on Information Systems 19(3):217–241
Article Google Scholar
Moffat A, Bell TH (1995) In situ generation of compressed inverted files. Journal of the American Society for Information Science 46(7):537–550
Article Google Scholar
Moffat A, Stuiver L (2000) Binary interpolative coding for effective index compression. Journal of Information Retrieval 3(1):25–47
Article Google Scholar
Moffat A, Zobel J (1996) Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems 14(4):349–379
Article Google Scholar
Moffat A, Webber W, Zobel J, Baeza-Yates R (2007) A pipelined architecture for distributed text query evaluation. Journal of Information Retrieval 10(3):205–231
Article Google Scholar
Najork M, Wiener J (2001) Breadth-first crawling yields high-quality pages. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 114–118
Google Scholar
Ntoulas A, Cho J (2007) Pruning policies for two-tiered inverted index with correctness guarantee. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 191–198
Google Scholar
Ntoulas A, Cho J, Olston C (2004) What’s new on the Web?: The evolution of the Web from a search engine perspective. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 1–12
Google Scholar
Olston C, Pandey S (2008) Recrawl scheduling based on information longevity. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 437–446
Google Scholar
Ozcan R, Altingovde I, Ulusoy O (2008) Static query result caching revisited. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 1169–1170
Google Scholar
Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: Bringing order to the Web. http://ilpubs.stanford.edu:8090/422/, visited on February, 2011
Pandey S, Olston C (2005) User-centric web crawling. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 401–411
Chapter Google Scholar
Pandey S, Olston C (2008) Crawl ordering by search impact. In: Proceedings of the ACM Conference on web Search and Data Mining. ACM Press, New York, NY, pp 3–14
Chapter Google Scholar
Persin M (1994) Document filtering for fast ranking. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 339–348
Google Scholar
Pitkow J, Schütze H, Cass T, Cooley R, Turnbull D, Edmonds A, Adar E, Breuel T (2002) Personalized search. Communications of the ACM 45(9):50–55
Google Scholar
Puppin D, Silvestri F, Perego R, Baeza-Yates R (2010) Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Transactions on Information Systems 28(2):1–36
Article Google Scholar
Radoslavov P, Govindan R, Estrin D (2002) Topology-informed Internet replica placement. Computer Communications 25(4):384–392
Article Google Scholar
Rafiei D, Bharat K, Shukla A (2010) Diversifying web search results. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 781–790
Chapter Google Scholar
Raghavan S, Garcia-Molina H (2001) Crawling the hidden Web. In: Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, pp 129–138
Google Scholar
Rasolofo Y, Savoy J (2003) Term proximity scoring for keyword-based retrieval systems. In: Sebastiani F (ed) Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science, vol 2633. Springer, Berlin/Heidelberg, pp 79. doi:10.1007/3-540-36618-0_15, visited on December, 2010
Google Scholar
Ribeiro-Neto B, Barbosa R (1998) Query performance for tightly coupled distributed digital libraries. In: Proceedings of the ACM Conference on Digital Libraries. ACM Press, New York, NY, pp 182–190
Chapter Google Scholar
Ribeiro-Neto B, Kitajima J, Navarro G, Sant’Ana C, Ziviani N (1998) Parallel generation of inverted files for distributed text collections. In: Proceedings of the Conference of the Chilean Computer Science Society. IEEE Computer Society, Washington, DC, pp 149–157
Google Scholar
Ribeiro-Neto B, Moura E, Neubert M, Ziviani N (1999) Efficient distributed algorithms to build inverted files. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 105–112
Google Scholar
Risvik K, Aasheim Y, Lidal M (2003) Multi-tier architecture for web search engines. In: Proceedings of the Latin American Conference on World Wide Web. IEEE Computer Society, Washington, DC, p 132
Google Scholar
Saraiva P, Silva de Moura E, Ziviani N, Meira W, Fonseca R, Riberio-Neto B (2001) Rank-preserving two-level caching for scalable search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 51–58
Google Scholar
Sarigiannis C, Plachouras V, Baeza-Yates R (2009) A study of the impact of index updates on distributed query processing for web search. In: Proceedings of the European Conference on Information Retrieval. Springer, Berlin/Heidelberg, pp 595–602
Google Scholar
Schenkel R, Broschart A, Hwang S, Theobald M, Weikum G (2007) Efficient text proximity search. In: Proceedings of the International Symposium on String Processing Information Retrieval. Lecture Notes in Computer Science, vol 4726. Springer, Berlin/Heidelberg, pp 287–299
Chapter Google Scholar
Scholer F, Williams H, Yiannis J, Zobel J (2002) Compression of inverted indexes for fast query evaluation. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 222–229
Google Scholar
Schurman E, Brutlag J (2009) Performance related changes and their user impact. http://velocityconference.blip.tv/file/2279751/, visited on February, 2011
Shieh WY, Chung CP (2005) A statistics-based approach to incrementally update inverted files. Information Processing and Management 41(2):275–288
Google Scholar
Shieh WY, Chen TF, Shann J, Chung CP (2003) Inverted file compression through document identifier reassignment. Information Processing and Management 39(1):117–131
Article MATH Google Scholar
Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceedings of the International Conference on Data Engineering. IEEE Computer Society, Washington, DC, p 357
Google Scholar
Si L, Jin R, Callan J, Ogilvie P (2002a) A language modeling framework for resource selection and results merging. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 391–397
Google Scholar
Silvestri F (2007) Sorting out the document identifier assignment problem. In: Amati G, Carpineto C, Romano G (eds) Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science, vol 4425. Springer, Berlin/Heidelberg, pp 101–112
Google Scholar
Silvestri F, Orlando S, Perego R (2004) Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 305–312
Google Scholar
Skobeltsyn G, Junqueira F, Plachouras V, Baeza-Yates R (2008) ResIn: a combination of results caching and index pruning for high-performance web search engines. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 131–138
Google Scholar
Strohman T, Turtle H, Croft W (2005) Optimization strategies for complex queries. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 219–225
Google Scholar
Sun JT, Zeng HJ, Liu H, Lu Y, Chen Z (2005) CubeSVD: a novel approach to personalized web search. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 382–390
Chapter Google Scholar
Tan B, Shen X, Zhai C (2006) Mining long-term search history to improve search accuracy. In: Proceedings of the ACM Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, pp 718–723
Chapter Google Scholar
Teevan J, Dumais S, Horvitz E (2005) Personalizing search via automated analysis of interests and activities. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 449–456
Google Scholar
Tomasic A, Garcia-Molina H (1993) Caching and database scaling in distributed shared-nothing information retrieval systems. ACM SIGMOD Record 22(2):129–138
Article Google Scholar
Tomasic A, Garcia-Molina H, Shoens K (1994) Incremental updates of inverted lists for text document retrieval. In: Proceedings of the ACM Conference on Management of Data. ACM Press, New York, NY, pp 289–300
Google Scholar
Tomasic A, Gravano L, Lue C, Schwarz P, Haas L (1997) Data structures for efficient broker implementation. ACM Transactions on Information Systems 15(3):223–253
Article Google Scholar
Tonellotto N, Macdonald C, Ounis I (2010) Efficient dynamic pruning with proximity support. http://www.lsdsir.org/wp-content/uploads/2010/05/lsdsir10-5.pdf, visited on February, 2011
Turpin A, Tsegay Y, Hawking D, Williams H (2007) Fast generation of result snippets in web search. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 127–134
Google Scholar
Turtle H, Flood J (1995) Query evaluation: Strategies and optimizations. Information Processing and Management 31(6):831–850
Article Google Scholar
Varadarajan R, Hristidis V (2006) A system for query-specific document summarization. In: Proceedings of the ACM Conference on Information and Knowledge Management. ACM Press, New York, NY, pp 622–631
Chapter Google Scholar
Wang L, Lin J, Metzler D (2010) Learning to efficiently rank. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 138–145
Google Scholar
Witten I, Moffat A, Bell T (1999) Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco, CA
Google Scholar
Wolf J, Squillante M, Yu P, Sethuraman J, Ozsen L (2002) Optimal crawling strategies for web search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 136–147
Google Scholar
Wong WP, Lee D (1993) Implementations of partial document ranking using inverted files. Information Processing and Management 29(5):647–669
Article Google Scholar
Xu J, Callan J (1998) Effective retrieval with distributed collections. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 112–120
Google Scholar
Xu J, Croft W (1999) Cluster-based language models for distributed retrieval. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 254–261
Google Scholar
Yan H, Ding S, Suel T (2009a) Compressing term positions in web indexes. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 147–154
Google Scholar
Yan H, Ding S, Suel T (2009b) Inverted index compression and query processing with optimized document ordering. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 401–410
Google Scholar
Yu F, Xie Y, Ke Q (2010) Sbotminer: Large scale search bot detection. In: Proceedings of the ACM Conference on web Search and Data Mining. ACM Press, New York, NY, pp 421–430
Chapter Google Scholar
Yuwono B, Lee D (1997) Server ranking for distributed text retrieval systems on the Internet. In: Proceedings of the International Conference on Database Systems for Advanced Applications. World Scientific, Singapore, pp 41–50
Chapter Google Scholar
Zeinalipour-Yazti D, Dikaiakos M (2002) Design and implementation of a distributed crawler and filtering processor. In: Proceedings of the International Workshop on Next Generation Information Technologies and Systems. Springer, London, pp 58–74
Chapter Google Scholar
Zhang J, Long X, Suel T (2008) Performance of compressed inverted list caching in search engines. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 387–396
Google Scholar
Zobel J, Moffat A (2006) Inverted files for text search engines. ACM Computing Surveys 38(2):6
Article Google Scholar

Download references

Author information

Authors and Affiliations

Yahoo! Research, Diagonal 177, p9, 08018, Barcelona, Spain
Berkant Barla Cambazoglu & Ricardo Baeza-Yates

Authors

Berkant Barla Cambazoglu
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Baeza-Yates
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Berkant Barla Cambazoglu .

Editor information

Editors and Affiliations

, Department of Information Engineering, University of Padua, Via G. Gradenigo, 6, Padua, 35131, Italy
Massimo Melucci
Yahoo! Research Barcelona, Ocata 1, Barcelona, 08003, Spain
Ricardo Baeza-Yates

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Cambazoglu, B.B., Baeza-Yates, R. (2011). Scalability Challenges in Web Search Engines. In: Melucci, M., Baeza-Yates, R. (eds) Advanced Topics in Information Retrieval. The Information Retrieval Series, vol 33. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20946-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-20946-8_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20945-1
Online ISBN: 978-3-642-20946-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics