Advertisement

Information Retrieval and Search Engines

  • Charu C. Aggarwal
Chapter

Abstract

Information retrieval is the process of satisfying user information needs that are expressed as textual queries. Search engines represent a Web-specific example of the information retrieval paradigm. The problem of Web search has many additional challenges, such as the collection of Web resources, the organization of these resources, and the use of hyperlinks to aid the search. Whereas traditional information retrieval only uses the content of documents to retrieve results of queries, the Web requires stronger mechanisms for quality control because of its open nature. Furthermore, Web documents contain significant meta-information and zoned text, such as title, author, or anchor text, which can be leveraged to improve retrieval accuracy.

Bibliography

  1. [3]
    C. Aggarwal. Recommender systems: The textbook. Springer, 2016.Google Scholar
  2. [22]
    V. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with effective early termination. ACM SIGIR Conference, pp. 35–42, 2001.Google Scholar
  3. [23]
    V. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1), pp. 151–166, 2005.CrossRefGoogle Scholar
  4. [24]
    V. Anh and A. Moffat. Pruned query evaluation using pre-computed impacts. ACM SIGIR Conference, pp. 372–379, 2006.Google Scholar
  5. [25]
    V. Anh and A. Moffat. Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering, 18(6), pp. 857–861, 2006.CrossRefGoogle Scholar
  6. [31]
    R. Baeza-Yates, and B. Ribeiro-Neto. Modern information retrieval. ACM press, 2011.Google Scholar
  7. [64]
    S. Brin, and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7), pp. 107–117, 1998.Google Scholar
  8. [70]
    C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. ICML Conference, pp. 86–96, 2005.Google Scholar
  9. [71]
    S. Buttcher, C. Clarke, and G. V. Cormack. Information retrieval: Implementing and evaluating search engines. The MIT Press, 2010.Google Scholar
  10. [72]
    J. Callan. Distributed information retrieval. Advances in Information Retrieval, Springer, pp. 127–150, 2000.Google Scholar
  11. [74]
    Y. Cao, J. Xu, T. Liu, H. Li, Y. Huang, and H.-W. Hon. Adapting ranking SVM to document retrieval. ACM SIGIR Conference, pp. 186–193, 2006.Google Scholar
  12. [75]
    Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. ICML Conference, pp. 129–136, 2007.Google Scholar
  13. [77]
    D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. Maarek, and A. Soffer. Static index pruning for information retrieval systems. ACM SIGIR Conference, pp. 43–50, 2001.Google Scholar
  14. [79]
    S. Chakrabarti. Mining the Web: Discovering knowledge from hypertext data. Morgan Kaufmann, 2003.Google Scholar
  15. [83]
    S. Chakrabarti, M. Van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 31(11), pp. 1623–1640, 1999.CrossRefGoogle Scholar
  16. [93]
    J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks, 30(1–7), pp. 161–172, 1998.Google Scholar
  17. [105]
    W. Cohen, R. Schapire, and Y. Singer. Learning to Order Things. Journal of Artificial Intelligence Research, 10, pp. 243–270, 1999.MathSciNetzbMATHGoogle Scholar
  18. [119]
    W. B. Croft and D. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4), pp. 285–295, 1979.CrossRefGoogle Scholar
  19. [120]
    W. B. Croft, D. Metzler, and T. Strohman. Search engines: Information retrieval in practice, Addison-Wesley Publishing Company, 2009.Google Scholar
  20. [128]
    J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), pp. 107–113, 2008.CrossRefGoogle Scholar
  21. [153]
    P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2), pp. 194–203, 1975.MathSciNetCrossRefGoogle Scholar
  22. [163]
    C. Faloutsos and S. Christodoulakis. Signature files: An access method for documents and its analytical performance evaluation. ACM Transactions on Information Systems, 2(4), pp. 267–288, 1984.CrossRefGoogle Scholar
  23. [184]
    W. Greiff. A theory of term weighting based on exploratory data analysis. ACM SIGIR Conference, pp. 11–19, 1998.Google Scholar
  24. [194]
    D. Grossman and O. Frieder. Information retrieval: Algorithms and heuristics, Springer Science and Business Media, 2012.Google Scholar
  25. [205]
    T. H. Haveliwala. Topic-sensitive pagerank. World Wide Web Conference, pp. 517-526, 2002.Google Scholar
  26. [214]
    D. Hiemstra. A linguistically motivated probabilistic model of information retrieval. International Conference on Theory and Practice of Digital Libraries, pp. 569–584, 1998.CrossRefGoogle Scholar
  27. [216]
    S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. Journal of the American Society for Information Science and Technology, 54(8), pp. 713–729, 2003.CrossRefGoogle Scholar
  28. [244]
    T. Joachims. Optimizing search engines using clickthrough data. ACM KDD Conference, pp. 133–142, 2002.Google Scholar
  29. [262]
    J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5), pp. 604–632, 1999.MathSciNetCrossRefGoogle Scholar
  30. [278]
    R. Lempel and S. Moran. Predictive caching and prefetching of query results in search engines. World Wide Web Conference, pp. 19–28, 2003.Google Scholar
  31. [280]
    J. Leskovec, A. Rajaraman, and J. Ullman. Mining of massive datasets. Cambridge University Press, 2012.Google Scholar
  32. [281]
    N. Lester, J. Zobel, and H. Williams. Efficient online index maintenance for contiguous inverted lists. Information Processing and Management, 42(4), pp. 916–933, 2006.CrossRefGoogle Scholar
  33. [303]
    B. Liu. Web data mining: exploring hyperlinks, contents, and usage data. Springer, New York, 2007.Google Scholar
  34. [307]
    T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), pp. 225–231, 2009.CrossRefGoogle Scholar
  35. [309]
    X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. VLDB Conference, pp. 129–140, 2003.CrossRefGoogle Scholar
  36. [321]
    C. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, Cambridge, 2008.Google Scholar
  37. [334]
    S. Melink, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. ACM Transactions on Information Systems, 19(3), pp. 217–241, 2001.CrossRefGoogle Scholar
  38. [346]
    D. Miller, T. Leek, and R. Schwartz. A Hidden Markov Model information retrieval system. ACM SIGIR Conference, pp. 214–221, 1999.Google Scholar
  39. [354]
    A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4), pp. 14(4), 1996.CrossRefGoogle Scholar
  40. [366]
    A. Ntoulas and J. Cho. Pruning policies for two-tiered inverted index with correctness guarantee. ACM SIGIR Conference, pp. 191–198, 2007.Google Scholar
  41. [370]
    L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation engine: Bringing order to the web. Technical Report, 1999–0120, Computer Science Department, Stanford University, 1998.Google Scholar
  42. [385]
    J. Ponte and W. Croft. A language modeling approach to information retrieval. ACM SIGIR Conference, pp. 275–281, 1998.Google Scholar
  43. [401]
    R. Rehurek and P. Sojka. Software framework for topic modelling with large corpora. LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50, 2010. https://radimrehurek.com/gensim/index.html
  44. [405]
    B. Ribeiro-Neto, E. Moura, M. Neubert, and N. Ziviani. Efficient distributed algorithms to build inverted files. ACM SIGIR Conference, pp. 105–112, 1999.Google Scholar
  45. [406]
    M. Richardson, A. Prakash, and E. Brill. Beyond PageRank: machine learning for static ranking. World Wide Web Conference, pp. 707–715, 2006.Google Scholar
  46. [410]
    S. Robertson. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60, pp. 503–520, 2004.CrossRefGoogle Scholar
  47. [411]
    S. Robertson and K. Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), pp. 129–146, 1976.CrossRefGoogle Scholar
  48. [412]
    S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. ACM CIKM Conference, pp. 42–49, 2004.Google Scholar
  49. [423]
    G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval, Technical Report 87–881, Cornell University, 1987. https://ecommons.cornell.edu/bitstream/handle/1813/6721/87-881.pdf?sequence=1
  50. [426]
    G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11), pp. 613–620, 1975.CrossRefGoogle Scholar
  51. [427]
    H. Samet. Foundations of multidimensional and metric data structures. Morgan Kaufmann, 2006.Google Scholar
  52. [429]
    P. Saraiva, E. Silva de Moura, N. Ziviani, W. Meira, R. Fonseca, and B. Riberio-Neto. Rank-preserving two-level caching for scalable search engines. ACM SIGIR Conference, pp. 51–58, 2001.Google Scholar
  53. [435]
    F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. ACM SIGIR Conference, pp. 222–229, 2002.Google Scholar
  54. [450]
    A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. ACM SIGIR Conference, pp. 21–29, 1996.Google Scholar
  55. [453]
    K. Spärck Jones. A statistical interpretation of term specificity and its application in information retrieval. Journal of Documentation, 28(1), pp. 11–21, 1972.CrossRefGoogle Scholar
  56. [456]
    K. Spärck Jones, S. Walker, and S. Robertson. A probabilistic model of information retrieval: development and comparative experiments: Part 2. Information Processing and Management, 36(6), pp. 809–840, 2000.CrossRefGoogle Scholar
  57. [472]
    M. Taylor, H. Zaragoza, N. Craswell, S. Robertson, and C. Burges. Optimisation methods for ranking functions with multiple parameters. ACM CIKM Conference, pp. 585–593, 2006.Google Scholar
  58. [480]
    C. J. van Rijsbergen. Information retrieval. Butterworths, London, 1979.Google Scholar
  59. [502]
    H. Williams, J. Zobel, and D. Bahle. Fast phrase querying with combined indexes. ACM Transactions on Information Systems, 22(4), pp. 573–594, 2004.CrossRefGoogle Scholar
  60. [506]
    I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and indexing documents and images. Morgan Kaufmann, 1999.Google Scholar
  61. [522]
    Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average precision. ACM SIGIR Conference, pp. 271–278, 2007.Google Scholar
  62. [527]
    C. Zhai. Statistical language models for information retrieval. Synthesis Lectures on Human Language Technologies, 1(1), pp. 1–141, 2008.MathSciNetCrossRefGoogle Scholar
  63. [528]
    C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), pp. 179–214, 2004.CrossRefGoogle Scholar
  64. [532]
    J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. World Wide Web Conference, pp, 387–396, 2008.Google Scholar
  65. [542]
    J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software: Practice and Experience, 25(3), pp. 331–345, 1995.Google Scholar
  66. [543]
    J. Zobel and P. Dart. Phonetic string matching: Lessons from information retrieval. ACM SIGIR Conference, pp. 166–172, 1996.Google Scholar
  67. [544]
    J. Zobel, A. Moffat, and K. Ramamohanarao. Inverted files versus signature files for text indexing. ACM Transactions on Database Systems, 23(4), pp. 453–490, 1998.CrossRefGoogle Scholar
  68. [545]
    J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys (CSUR), 38(2), 6, 2006.CrossRefGoogle Scholar
  69. [550]
  70. [582]
  71. [583]
  72. [584]
  73. [585]
  74. [586]
  75. [587]
  76. [588]
  77. [589]
  78. [590]

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Charu C. Aggarwal
    • 1
  1. 1.IBM T. J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations