Skip to main content

Information Retrieval and Search Engines

  • Chapter
  • First Online:

Abstract

Information retrieval is the process of satisfying user information needs that are expressed as textual queries. Search engines represent a Web-specific example of the information retrieval paradigm. The problem of Web search has many additional challenges, such as the collection of Web resources, the organization of these resources, and the use of hyperlinks to aid the search. Whereas traditional information retrieval only uses the content of documents to retrieve results of queries, the Web requires stronger mechanisms for quality control because of its open nature. Furthermore, Web documents contain significant meta-information and zoned text, such as title, author, or anchor text, which can be leveraged to improve retrieval accuracy.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    A lexicographically sorted order refers to the order in which terms occur in a dictionary.

  2. 2.

    If all query terms must be included in the result, then the intersection of the inverted lists can be performed up front and accumulators are assigned only to document identifiers that lie in this intersection. There are many such index elimination tricks that one can use to speed up the process.

  3. 3.

    One can set \(Q(\overline{X}) = 0\) and select \(G(\overline{X})\) to be the length of document \(\overline{X}\). Normalization with the query length is not necessary because it is constant across all documents and does not change the relative ranking.

  4. 4.

    In all the previous discussions on machine learned information retrieval, the training data is not specific to a particular query. However, each set of values of the extracted features is query-specific and multiple queries are represented in the same training data. The importance of the query-specific values of the meta-features (e.g., zones, authorship, location) of the document is learned with feedback data.

  5. 5.

    This was one of the earliest ideas proposed by Croft and Harper [119]. However, other alternatives are possible. Sometimes, a few relevant documents may be available, which can be used to estimate p j (1). The other idea is to allow p j (1) to rise with the number of documents n j containing term t j . For example, one can use \(p_{j}^{(1)} = \frac{1} {3} + \frac{2\cdot n_{j}} {3\cdot n}\) [184].

  6. 6.

    As discussed earlier, the square root or logarithm is frequently applied to term frequencies to reduce the impact of repeated words.

  7. 7.

    Such values of k 1 are recommended in TREC experiments.

  8. 8.

    Browsers also use POST requests, when additional information is needed by the Web server. For example, an item is usually bought on the POST request. However, such requests are not used by crawlers because they might inadvertently causes actions (such as buying), which were not desired by the crawler.

  9. 9.

    http://www.dmoz.org.

  10. 10.

    A formal mathematical treatment characterizes this in terms of the ergodicity of the underlying Markov chains. In ergodic Markov chains, a necessary requirement is that it is possible to reach any state from any other state using a sequence of one or more transitions. This condition is referred to as strong connectivity. An informal description is provided here to facilitate understanding.

  11. 11.

    In some applications such as bibliographic networks, the edge (i, j) may have a weight denoted by w ij . The transition probability p ij is defined in such cases by \(\frac{w_{ij}} {\sum _{j\in Out(i)}w_{ij}}\).

  12. 12.

    An alternative way to achieve this goal is to modify G by multiplying existing edge transition probabilities by the factor (1 −α) and then adding αn to the transition probability between each pair of nodes in G. As a result G will become a directed clique with bidirectional edges between each pair of nodes. Such strongly connected Markov chains have unique steady-state probabilities. The resulting graph can then be treated as a Markov chain without having to separately account for the teleportation component. This model is equivalent to that discussed in the chapter.

  13. 13.

    The left eigenvector \(\overline{X}\) of P is a row vector satisfying \(\overline{X}P =\lambda \overline{X}\). The right eigenvector \(\overline{Y }\) is a column vector satisfying \(P\overline{Y } =\lambda \overline{Y }\). For asymmetric matrices, the left and right eigenvectors are not the same. However, the eigenvalues are always the same. The unqualified term “eigenvector” refers to the right eigenvector by default.

  14. 14.

    http://www.dmoz.org.

Bibliography

  1. C. Aggarwal. Recommender systems: The textbook. Springer, 2016.

    Google Scholar 

  2. V. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with effective early termination. ACM SIGIR Conference, pp. 35–42, 2001.

    Google Scholar 

  3. V. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1), pp. 151–166, 2005.

    Article  Google Scholar 

  4. V. Anh and A. Moffat. Pruned query evaluation using pre-computed impacts. ACM SIGIR Conference, pp. 372–379, 2006.

    Google Scholar 

  5. V. Anh and A. Moffat. Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering, 18(6), pp. 857–861, 2006.

    Article  Google Scholar 

  6. R. Baeza-Yates, and B. Ribeiro-Neto. Modern information retrieval. ACM press, 2011.

    Google Scholar 

  7. S. Brin, and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7), pp. 107–117, 1998.

    Google Scholar 

  8. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. ICML Conference, pp. 86–96, 2005.

    Google Scholar 

  9. S. Buttcher, C. Clarke, and G. V. Cormack. Information retrieval: Implementing and evaluating search engines. The MIT Press, 2010.

    Google Scholar 

  10. J. Callan. Distributed information retrieval. Advances in Information Retrieval, Springer, pp. 127–150, 2000.

    Google Scholar 

  11. Y. Cao, J. Xu, T. Liu, H. Li, Y. Huang, and H.-W. Hon. Adapting ranking SVM to document retrieval. ACM SIGIR Conference, pp. 186–193, 2006.

    Google Scholar 

  12. Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. ICML Conference, pp. 129–136, 2007.

    Google Scholar 

  13. D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. Maarek, and A. Soffer. Static index pruning for information retrieval systems. ACM SIGIR Conference, pp. 43–50, 2001.

    Google Scholar 

  14. S. Chakrabarti. Mining the Web: Discovering knowledge from hypertext data. Morgan Kaufmann, 2003.

    Google Scholar 

  15. S. Chakrabarti, M. Van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 31(11), pp. 1623–1640, 1999.

    Article  Google Scholar 

  16. J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks, 30(1–7), pp. 161–172, 1998.

    Google Scholar 

  17. W. Cohen, R. Schapire, and Y. Singer. Learning to Order Things. Journal of Artificial Intelligence Research, 10, pp. 243–270, 1999.

    MathSciNet  MATH  Google Scholar 

  18. W. B. Croft and D. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4), pp. 285–295, 1979.

    Article  Google Scholar 

  19. W. B. Croft, D. Metzler, and T. Strohman. Search engines: Information retrieval in practice, Addison-Wesley Publishing Company, 2009.

    Google Scholar 

  20. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), pp. 107–113, 2008.

    Article  Google Scholar 

  21. P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2), pp. 194–203, 1975.

    Article  MathSciNet  Google Scholar 

  22. C. Faloutsos and S. Christodoulakis. Signature files: An access method for documents and its analytical performance evaluation. ACM Transactions on Information Systems, 2(4), pp. 267–288, 1984.

    Article  Google Scholar 

  23. W. Greiff. A theory of term weighting based on exploratory data analysis. ACM SIGIR Conference, pp. 11–19, 1998.

    Google Scholar 

  24. D. Grossman and O. Frieder. Information retrieval: Algorithms and heuristics, Springer Science and Business Media, 2012.

    Google Scholar 

  25. T. H. Haveliwala. Topic-sensitive pagerank. World Wide Web Conference, pp. 517-526, 2002.

    Google Scholar 

  26. D. Hiemstra. A linguistically motivated probabilistic model of information retrieval. International Conference on Theory and Practice of Digital Libraries, pp. 569–584, 1998.

    Chapter  Google Scholar 

  27. S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. Journal of the American Society for Information Science and Technology, 54(8), pp. 713–729, 2003.

    Article  Google Scholar 

  28. T. Joachims. Optimizing search engines using clickthrough data. ACM KDD Conference, pp. 133–142, 2002.

    Google Scholar 

  29. J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5), pp. 604–632, 1999.

    Article  MathSciNet  Google Scholar 

  30. R. Lempel and S. Moran. Predictive caching and prefetching of query results in search engines. World Wide Web Conference, pp. 19–28, 2003.

    Google Scholar 

  31. J. Leskovec, A. Rajaraman, and J. Ullman. Mining of massive datasets. Cambridge University Press, 2012.

    Google Scholar 

  32. N. Lester, J. Zobel, and H. Williams. Efficient online index maintenance for contiguous inverted lists. Information Processing and Management, 42(4), pp. 916–933, 2006.

    Article  Google Scholar 

  33. B. Liu. Web data mining: exploring hyperlinks, contents, and usage data. Springer, New York, 2007.

    Google Scholar 

  34. T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), pp. 225–231, 2009.

    Article  Google Scholar 

  35. X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. VLDB Conference, pp. 129–140, 2003.

    Chapter  Google Scholar 

  36. C. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, Cambridge, 2008.

    Google Scholar 

  37. S. Melink, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. ACM Transactions on Information Systems, 19(3), pp. 217–241, 2001.

    Article  Google Scholar 

  38. D. Miller, T. Leek, and R. Schwartz. A Hidden Markov Model information retrieval system. ACM SIGIR Conference, pp. 214–221, 1999.

    Google Scholar 

  39. A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4), pp. 14(4), 1996.

    Article  Google Scholar 

  40. A. Ntoulas and J. Cho. Pruning policies for two-tiered inverted index with correctness guarantee. ACM SIGIR Conference, pp. 191–198, 2007.

    Google Scholar 

  41. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation engine: Bringing order to the web. Technical Report, 1999–0120, Computer Science Department, Stanford University, 1998.

    Google Scholar 

  42. J. Ponte and W. Croft. A language modeling approach to information retrieval. ACM SIGIR Conference, pp. 275–281, 1998.

    Google Scholar 

  43. R. Rehurek and P. Sojka. Software framework for topic modelling with large corpora. LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50, 2010. https://radimrehurek.com/gensim/index.html

  44. B. Ribeiro-Neto, E. Moura, M. Neubert, and N. Ziviani. Efficient distributed algorithms to build inverted files. ACM SIGIR Conference, pp. 105–112, 1999.

    Google Scholar 

  45. M. Richardson, A. Prakash, and E. Brill. Beyond PageRank: machine learning for static ranking. World Wide Web Conference, pp. 707–715, 2006.

    Google Scholar 

  46. S. Robertson. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60, pp. 503–520, 2004.

    Article  Google Scholar 

  47. S. Robertson and K. Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), pp. 129–146, 1976.

    Article  Google Scholar 

  48. S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. ACM CIKM Conference, pp. 42–49, 2004.

    Google Scholar 

  49. G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval, Technical Report 87–881, Cornell University, 1987. https://ecommons.cornell.edu/bitstream/handle/1813/6721/87-881.pdf?sequence=1

  50. G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11), pp. 613–620, 1975.

    Article  Google Scholar 

  51. H. Samet. Foundations of multidimensional and metric data structures. Morgan Kaufmann, 2006.

    Google Scholar 

  52. P. Saraiva, E. Silva de Moura, N. Ziviani, W. Meira, R. Fonseca, and B. Riberio-Neto. Rank-preserving two-level caching for scalable search engines. ACM SIGIR Conference, pp. 51–58, 2001.

    Google Scholar 

  53. F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. ACM SIGIR Conference, pp. 222–229, 2002.

    Google Scholar 

  54. A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. ACM SIGIR Conference, pp. 21–29, 1996.

    Google Scholar 

  55. K. Spärck Jones. A statistical interpretation of term specificity and its application in information retrieval. Journal of Documentation, 28(1), pp. 11–21, 1972.

    Article  Google Scholar 

  56. K. Spärck Jones, S. Walker, and S. Robertson. A probabilistic model of information retrieval: development and comparative experiments: Part 2. Information Processing and Management, 36(6), pp. 809–840, 2000.

    Article  Google Scholar 

  57. M. Taylor, H. Zaragoza, N. Craswell, S. Robertson, and C. Burges. Optimisation methods for ranking functions with multiple parameters. ACM CIKM Conference, pp. 585–593, 2006.

    Google Scholar 

  58. C. J. van Rijsbergen. Information retrieval. Butterworths, London, 1979.

    Google Scholar 

  59. H. Williams, J. Zobel, and D. Bahle. Fast phrase querying with combined indexes. ACM Transactions on Information Systems, 22(4), pp. 573–594, 2004.

    Article  Google Scholar 

  60. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and indexing documents and images. Morgan Kaufmann, 1999.

    Google Scholar 

  61. Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average precision. ACM SIGIR Conference, pp. 271–278, 2007.

    Google Scholar 

  62. C. Zhai. Statistical language models for information retrieval. Synthesis Lectures on Human Language Technologies, 1(1), pp. 1–141, 2008.

    Article  MathSciNet  Google Scholar 

  63. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), pp. 179–214, 2004.

    Article  Google Scholar 

  64. J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. World Wide Web Conference, pp, 387–396, 2008.

    Google Scholar 

  65. J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software: Practice and Experience, 25(3), pp. 331–345, 1995.

    Google Scholar 

  66. J. Zobel and P. Dart. Phonetic string matching: Lessons from information retrieval. ACM SIGIR Conference, pp. 166–172, 1996.

    Google Scholar 

  67. J. Zobel, A. Moffat, and K. Ramamohanarao. Inverted files versus signature files for text indexing. ACM Transactions on Database Systems, 23(4), pp. 453–490, 1998.

    Article  Google Scholar 

  68. J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys (CSUR), 38(2), 6, 2006.

    Article  Google Scholar 

  69. http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

  70. http://www.lemurproject.org

  71. https://nutch.apache.org/

  72. https://scrapy.org/

  73. https://webarchive.jira.com/wiki/display/Heritrix

  74. http://www.dataparksearch.org/

  75. http://lucene.apache.org/core/

  76. http://lucene.apache.org/solr/

  77. http://sphinxsearch.com/

  78. https://snap.stanford.edu/snap/description.html

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Aggarwal, C.C. (2018). Information Retrieval and Search Engines. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73531-3_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73530-6

  • Online ISBN: 978-3-319-73531-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics