Information Retrieval and Search Engines

Aggarwal, Charu C.

doi:10.1007/978-3-319-73531-3_9

Charu C. Aggarwal²

10k Accesses
6 Citations

Abstract

Information retrieval is the process of satisfying user information needs that are expressed as textual queries. Search engines represent a Web-specific example of the information retrieval paradigm. The problem of Web search has many additional challenges, such as the collection of Web resources, the organization of these resources, and the use of hyperlinks to aid the search. Whereas traditional information retrieval only uses the content of documents to retrieve results of queries, the Web requires stronger mechanisms for quality control because of its open nature. Furthermore, Web documents contain significant meta-information and zoned text, such as title, author, or anchor text, which can be leveraged to improve retrieval accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A lexicographically sorted order refers to the order in which terms occur in a dictionary.
2.
If all query terms must be included in the result, then the intersection of the inverted lists can be performed up front and accumulators are assigned only to document identifiers that lie in this intersection. There are many such index elimination tricks that one can use to speed up the process.
3.
One can set \(Q(\overline{X}) = 0\) and select \(G(\overline{X})\) to be the length of document \(\overline{X}\). Normalization with the query length is not necessary because it is constant across all documents and does not change the relative ranking.
4.
In all the previous discussions on machine learned information retrieval, the training data is not specific to a particular query. However, each set of values of the extracted features is query-specific and multiple queries are represented in the same training data. The importance of the query-specific values of the meta-features (e.g., zones, authorship, location) of the document is learned with feedback data.
5.
This was one of the earliest ideas proposed by Croft and Harper [119]. However, other alternatives are possible. Sometimes, a few relevant documents may be available, which can be used to estimate p _j ⁽¹⁾. The other idea is to allow p _j ⁽¹⁾ to rise with the number of documents n _j containing term t _j. For example, one can use \(p_{j}^{(1)} = \frac{1} {3} + \frac{2\cdot n_{j}} {3\cdot n}\) [184].
6.
As discussed earlier, the square root or logarithm is frequently applied to term frequencies to reduce the impact of repeated words.
7.
Such values of k ₁ are recommended in TREC experiments.
8.
Browsers also use POST requests, when additional information is needed by the Web server. For example, an item is usually bought on the POST request. However, such requests are not used by crawlers because they might inadvertently causes actions (such as buying), which were not desired by the crawler.
9.
http://www.dmoz.org.
10.
A formal mathematical treatment characterizes this in terms of the ergodicity of the underlying Markov chains. In ergodic Markov chains, a necessary requirement is that it is possible to reach any state from any other state using a sequence of one or more transitions. This condition is referred to as strong connectivity. An informal description is provided here to facilitate understanding.
11.
In some applications such as bibliographic networks, the edge (i, j) may have a weight denoted by w _ij. The transition probability p _ij is defined in such cases by \(\frac{w_{ij}} {\sum _{j\in Out(i)}w_{ij}}\).
12.
An alternative way to achieve this goal is to modify G by multiplying existing edge transition probabilities by the factor (1 −α) and then adding α∕n to the transition probability between each pair of nodes in G. As a result G will become a directed clique with bidirectional edges between each pair of nodes. Such strongly connected Markov chains have unique steady-state probabilities. The resulting graph can then be treated as a Markov chain without having to separately account for the teleportation component. This model is equivalent to that discussed in the chapter.
13.
The left eigenvector \(\overline{X}\) of P is a row vector satisfying \(\overline{X}P =\lambda \overline{X}\). The right eigenvector \(\overline{Y }\) is a column vector satisfying \(P\overline{Y } =\lambda \overline{Y }\). For asymmetric matrices, the left and right eigenvectors are not the same. However, the eigenvalues are always the same. The unqualified term “eigenvector” refers to the right eigenvector by default.
14.
http://www.dmoz.org.

Bibliography

C. Aggarwal. Recommender systems: The textbook. Springer, 2016.
Google Scholar
V. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with effective early termination. ACM SIGIR Conference, pp. 35–42, 2001.
Google Scholar
V. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1), pp. 151–166, 2005.
Article Google Scholar
V. Anh and A. Moffat. Pruned query evaluation using pre-computed impacts. ACM SIGIR Conference, pp. 372–379, 2006.
Google Scholar
V. Anh and A. Moffat. Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering, 18(6), pp. 857–861, 2006.
Article Google Scholar
R. Baeza-Yates, and B. Ribeiro-Neto. Modern information retrieval. ACM press, 2011.
Google Scholar
S. Brin, and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7), pp. 107–117, 1998.
Google Scholar
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. ICML Conference, pp. 86–96, 2005.
Google Scholar
S. Buttcher, C. Clarke, and G. V. Cormack. Information retrieval: Implementing and evaluating search engines. The MIT Press, 2010.
Google Scholar
J. Callan. Distributed information retrieval. Advances in Information Retrieval, Springer, pp. 127–150, 2000.
Google Scholar
Y. Cao, J. Xu, T. Liu, H. Li, Y. Huang, and H.-W. Hon. Adapting ranking SVM to document retrieval. ACM SIGIR Conference, pp. 186–193, 2006.
Google Scholar
Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. ICML Conference, pp. 129–136, 2007.
Google Scholar
D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. Maarek, and A. Soffer. Static index pruning for information retrieval systems. ACM SIGIR Conference, pp. 43–50, 2001.
Google Scholar
S. Chakrabarti. Mining the Web: Discovering knowledge from hypertext data. Morgan Kaufmann, 2003.
Google Scholar
S. Chakrabarti, M. Van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 31(11), pp. 1623–1640, 1999.
Article Google Scholar
J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks, 30(1–7), pp. 161–172, 1998.
Google Scholar
W. Cohen, R. Schapire, and Y. Singer. Learning to Order Things. Journal of Artificial Intelligence Research, 10, pp. 243–270, 1999.
MathSciNet MATH Google Scholar
W. B. Croft and D. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4), pp. 285–295, 1979.
Article Google Scholar
W. B. Croft, D. Metzler, and T. Strohman. Search engines: Information retrieval in practice, Addison-Wesley Publishing Company, 2009.
Google Scholar
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), pp. 107–113, 2008.
Article Google Scholar
P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2), pp. 194–203, 1975.
Article MathSciNet Google Scholar
C. Faloutsos and S. Christodoulakis. Signature files: An access method for documents and its analytical performance evaluation. ACM Transactions on Information Systems, 2(4), pp. 267–288, 1984.
Article Google Scholar
W. Greiff. A theory of term weighting based on exploratory data analysis. ACM SIGIR Conference, pp. 11–19, 1998.
Google Scholar
D. Grossman and O. Frieder. Information retrieval: Algorithms and heuristics, Springer Science and Business Media, 2012.
Google Scholar
T. H. Haveliwala. Topic-sensitive pagerank. World Wide Web Conference, pp. 517-526, 2002.
Google Scholar
D. Hiemstra. A linguistically motivated probabilistic model of information retrieval. International Conference on Theory and Practice of Digital Libraries, pp. 569–584, 1998.
Chapter Google Scholar
S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. Journal of the American Society for Information Science and Technology, 54(8), pp. 713–729, 2003.
Article Google Scholar
T. Joachims. Optimizing search engines using clickthrough data. ACM KDD Conference, pp. 133–142, 2002.
Google Scholar
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5), pp. 604–632, 1999.
Article MathSciNet Google Scholar
R. Lempel and S. Moran. Predictive caching and prefetching of query results in search engines. World Wide Web Conference, pp. 19–28, 2003.
Google Scholar
J. Leskovec, A. Rajaraman, and J. Ullman. Mining of massive datasets. Cambridge University Press, 2012.
Google Scholar
N. Lester, J. Zobel, and H. Williams. Efficient online index maintenance for contiguous inverted lists. Information Processing and Management, 42(4), pp. 916–933, 2006.
Article Google Scholar
B. Liu. Web data mining: exploring hyperlinks, contents, and usage data. Springer, New York, 2007.
Google Scholar
T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), pp. 225–231, 2009.
Article Google Scholar
X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. VLDB Conference, pp. 129–140, 2003.
Chapter Google Scholar
C. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, Cambridge, 2008.
Google Scholar
S. Melink, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. ACM Transactions on Information Systems, 19(3), pp. 217–241, 2001.
Article Google Scholar
D. Miller, T. Leek, and R. Schwartz. A Hidden Markov Model information retrieval system. ACM SIGIR Conference, pp. 214–221, 1999.
Google Scholar
A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4), pp. 14(4), 1996.
Article Google Scholar
A. Ntoulas and J. Cho. Pruning policies for two-tiered inverted index with correctness guarantee. ACM SIGIR Conference, pp. 191–198, 2007.
Google Scholar
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation engine: Bringing order to the web. Technical Report, 1999–0120, Computer Science Department, Stanford University, 1998.
Google Scholar
J. Ponte and W. Croft. A language modeling approach to information retrieval. ACM SIGIR Conference, pp. 275–281, 1998.
Google Scholar
R. Rehurek and P. Sojka. Software framework for topic modelling with large corpora. LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50, 2010. https://radimrehurek.com/gensim/index.html
B. Ribeiro-Neto, E. Moura, M. Neubert, and N. Ziviani. Efficient distributed algorithms to build inverted files. ACM SIGIR Conference, pp. 105–112, 1999.
Google Scholar
M. Richardson, A. Prakash, and E. Brill. Beyond PageRank: machine learning for static ranking. World Wide Web Conference, pp. 707–715, 2006.
Google Scholar
S. Robertson. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60, pp. 503–520, 2004.
Article Google Scholar
S. Robertson and K. Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), pp. 129–146, 1976.
Article Google Scholar
S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. ACM CIKM Conference, pp. 42–49, 2004.
Google Scholar
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval, Technical Report 87–881, Cornell University, 1987. https://ecommons.cornell.edu/bitstream/handle/1813/6721/87-881.pdf?sequence=1
G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11), pp. 613–620, 1975.
Article Google Scholar
H. Samet. Foundations of multidimensional and metric data structures. Morgan Kaufmann, 2006.
Google Scholar
P. Saraiva, E. Silva de Moura, N. Ziviani, W. Meira, R. Fonseca, and B. Riberio-Neto. Rank-preserving two-level caching for scalable search engines. ACM SIGIR Conference, pp. 51–58, 2001.
Google Scholar
F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. ACM SIGIR Conference, pp. 222–229, 2002.
Google Scholar
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. ACM SIGIR Conference, pp. 21–29, 1996.
Google Scholar
K. Spärck Jones. A statistical interpretation of term specificity and its application in information retrieval. Journal of Documentation, 28(1), pp. 11–21, 1972.
Article Google Scholar
K. Spärck Jones, S. Walker, and S. Robertson. A probabilistic model of information retrieval: development and comparative experiments: Part 2. Information Processing and Management, 36(6), pp. 809–840, 2000.
Article Google Scholar
M. Taylor, H. Zaragoza, N. Craswell, S. Robertson, and C. Burges. Optimisation methods for ranking functions with multiple parameters. ACM CIKM Conference, pp. 585–593, 2006.
Google Scholar
C. J. van Rijsbergen. Information retrieval. Butterworths, London, 1979.
Google Scholar
H. Williams, J. Zobel, and D. Bahle. Fast phrase querying with combined indexes. ACM Transactions on Information Systems, 22(4), pp. 573–594, 2004.
Article Google Scholar
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and indexing documents and images. Morgan Kaufmann, 1999.
Google Scholar
Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average precision. ACM SIGIR Conference, pp. 271–278, 2007.
Google Scholar
C. Zhai. Statistical language models for information retrieval. Synthesis Lectures on Human Language Technologies, 1(1), pp. 1–141, 2008.
Article MathSciNet Google Scholar
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), pp. 179–214, 2004.
Article Google Scholar
J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. World Wide Web Conference, pp, 387–396, 2008.
Google Scholar
J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software: Practice and Experience, 25(3), pp. 331–345, 1995.
Google Scholar
J. Zobel and P. Dart. Phonetic string matching: Lessons from information retrieval. ACM SIGIR Conference, pp. 166–172, 1996.
Google Scholar
J. Zobel, A. Moffat, and K. Ramamohanarao. Inverted files versus signature files for text indexing. ACM Transactions on Database Systems, 23(4), pp. 453–490, 1998.
Article Google Scholar
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys (CSUR), 38(2), 6, 2006.
Article Google Scholar
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
http://www.lemurproject.org
https://nutch.apache.org/
https://scrapy.org/
https://webarchive.jira.com/wiki/display/Heritrix
http://www.dataparksearch.org/
http://lucene.apache.org/core/
http://lucene.apache.org/solr/
http://sphinxsearch.com/
https://snap.stanford.edu/snap/description.html

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Charu C. Aggarwal

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aggarwal, C.C. (2018). Information Retrieval and Search Engines. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-73531-3_9
Published: 20 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73530-6
Online ISBN: 978-3-319-73531-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics