Abstract
Information retrieval is the process of satisfying user information needs that are expressed as textual queries. Search engines represent a Web-specific example of the information retrieval paradigm. The problem of Web search has many additional challenges, such as the collection of Web resources, the organization of these resources, and the use of hyperlinks to aid the search. Whereas traditional information retrieval only uses the content of documents to retrieve results of queries, the Web requires stronger mechanisms for quality control because of its open nature. Furthermore, Web documents contain significant meta-information and zoned text, such as title, author, or anchor text, which can be leveraged to improve retrieval accuracy.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
A lexicographically sorted order refers to the order in which terms occur in a dictionary.
- 2.
If all query terms must be included in the result, then the intersection of the inverted lists can be performed up front and accumulators are assigned only to document identifiers that lie in this intersection. There are many such index elimination tricks that one can use to speed up the process.
- 3.
One can set \(Q(\overline{X}) = 0\) and select \(G(\overline{X})\) to be the length of document \(\overline{X}\). Normalization with the query length is not necessary because it is constant across all documents and does not change the relative ranking.
- 4.
In all the previous discussions on machine learned information retrieval, the training data is not specific to a particular query. However, each set of values of the extracted features is query-specific and multiple queries are represented in the same training data. The importance of the query-specific values of the meta-features (e.g., zones, authorship, location) of the document is learned with feedback data.
- 5.
This was one of the earliest ideas proposed by Croft and Harper [119]. However, other alternatives are possible. Sometimes, a few relevant documents may be available, which can be used to estimate p j (1). The other idea is to allow p j (1) to rise with the number of documents n j containing term t j . For example, one can use \(p_{j}^{(1)} = \frac{1} {3} + \frac{2\cdot n_{j}} {3\cdot n}\) [184].
- 6.
As discussed earlier, the square root or logarithm is frequently applied to term frequencies to reduce the impact of repeated words.
- 7.
Such values of k 1 are recommended in TREC experiments.
- 8.
Browsers also use POST requests, when additional information is needed by the Web server. For example, an item is usually bought on the POST request. However, such requests are not used by crawlers because they might inadvertently causes actions (such as buying), which were not desired by the crawler.
- 9.
- 10.
A formal mathematical treatment characterizes this in terms of the ergodicity of the underlying Markov chains. In ergodic Markov chains, a necessary requirement is that it is possible to reach any state from any other state using a sequence of one or more transitions. This condition is referred to as strong connectivity. An informal description is provided here to facilitate understanding.
- 11.
In some applications such as bibliographic networks, the edge (i, j) may have a weight denoted by w ij . The transition probability p ij is defined in such cases by \(\frac{w_{ij}} {\sum _{j\in Out(i)}w_{ij}}\).
- 12.
An alternative way to achieve this goal is to modify G by multiplying existing edge transition probabilities by the factor (1 −α) and then adding α∕n to the transition probability between each pair of nodes in G. As a result G will become a directed clique with bidirectional edges between each pair of nodes. Such strongly connected Markov chains have unique steady-state probabilities. The resulting graph can then be treated as a Markov chain without having to separately account for the teleportation component. This model is equivalent to that discussed in the chapter.
- 13.
The left eigenvector \(\overline{X}\) of P is a row vector satisfying \(\overline{X}P =\lambda \overline{X}\). The right eigenvector \(\overline{Y }\) is a column vector satisfying \(P\overline{Y } =\lambda \overline{Y }\). For asymmetric matrices, the left and right eigenvectors are not the same. However, the eigenvalues are always the same. The unqualified term “eigenvector” refers to the right eigenvector by default.
- 14.
Bibliography
C. Aggarwal. Recommender systems: The textbook. Springer, 2016.
V. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with effective early termination. ACM SIGIR Conference, pp. 35–42, 2001.
V. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1), pp. 151–166, 2005.
V. Anh and A. Moffat. Pruned query evaluation using pre-computed impacts. ACM SIGIR Conference, pp. 372–379, 2006.
V. Anh and A. Moffat. Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering, 18(6), pp. 857–861, 2006.
R. Baeza-Yates, and B. Ribeiro-Neto. Modern information retrieval. ACM press, 2011.
S. Brin, and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7), pp. 107–117, 1998.
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. ICML Conference, pp. 86–96, 2005.
S. Buttcher, C. Clarke, and G. V. Cormack. Information retrieval: Implementing and evaluating search engines. The MIT Press, 2010.
J. Callan. Distributed information retrieval. Advances in Information Retrieval, Springer, pp. 127–150, 2000.
Y. Cao, J. Xu, T. Liu, H. Li, Y. Huang, and H.-W. Hon. Adapting ranking SVM to document retrieval. ACM SIGIR Conference, pp. 186–193, 2006.
Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. ICML Conference, pp. 129–136, 2007.
D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. Maarek, and A. Soffer. Static index pruning for information retrieval systems. ACM SIGIR Conference, pp. 43–50, 2001.
S. Chakrabarti. Mining the Web: Discovering knowledge from hypertext data. Morgan Kaufmann, 2003.
S. Chakrabarti, M. Van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 31(11), pp. 1623–1640, 1999.
J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks, 30(1–7), pp. 161–172, 1998.
W. Cohen, R. Schapire, and Y. Singer. Learning to Order Things. Journal of Artificial Intelligence Research, 10, pp. 243–270, 1999.
W. B. Croft and D. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4), pp. 285–295, 1979.
W. B. Croft, D. Metzler, and T. Strohman. Search engines: Information retrieval in practice, Addison-Wesley Publishing Company, 2009.
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), pp. 107–113, 2008.
P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2), pp. 194–203, 1975.
C. Faloutsos and S. Christodoulakis. Signature files: An access method for documents and its analytical performance evaluation. ACM Transactions on Information Systems, 2(4), pp. 267–288, 1984.
W. Greiff. A theory of term weighting based on exploratory data analysis. ACM SIGIR Conference, pp. 11–19, 1998.
D. Grossman and O. Frieder. Information retrieval: Algorithms and heuristics, Springer Science and Business Media, 2012.
T. H. Haveliwala. Topic-sensitive pagerank. World Wide Web Conference, pp. 517-526, 2002.
D. Hiemstra. A linguistically motivated probabilistic model of information retrieval. International Conference on Theory and Practice of Digital Libraries, pp. 569–584, 1998.
S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. Journal of the American Society for Information Science and Technology, 54(8), pp. 713–729, 2003.
T. Joachims. Optimizing search engines using clickthrough data. ACM KDD Conference, pp. 133–142, 2002.
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5), pp. 604–632, 1999.
R. Lempel and S. Moran. Predictive caching and prefetching of query results in search engines. World Wide Web Conference, pp. 19–28, 2003.
J. Leskovec, A. Rajaraman, and J. Ullman. Mining of massive datasets. Cambridge University Press, 2012.
N. Lester, J. Zobel, and H. Williams. Efficient online index maintenance for contiguous inverted lists. Information Processing and Management, 42(4), pp. 916–933, 2006.
B. Liu. Web data mining: exploring hyperlinks, contents, and usage data. Springer, New York, 2007.
T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), pp. 225–231, 2009.
X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. VLDB Conference, pp. 129–140, 2003.
C. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, Cambridge, 2008.
S. Melink, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. ACM Transactions on Information Systems, 19(3), pp. 217–241, 2001.
D. Miller, T. Leek, and R. Schwartz. A Hidden Markov Model information retrieval system. ACM SIGIR Conference, pp. 214–221, 1999.
A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4), pp. 14(4), 1996.
A. Ntoulas and J. Cho. Pruning policies for two-tiered inverted index with correctness guarantee. ACM SIGIR Conference, pp. 191–198, 2007.
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation engine: Bringing order to the web. Technical Report, 1999–0120, Computer Science Department, Stanford University, 1998.
J. Ponte and W. Croft. A language modeling approach to information retrieval. ACM SIGIR Conference, pp. 275–281, 1998.
R. Rehurek and P. Sojka. Software framework for topic modelling with large corpora. LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50, 2010. https://radimrehurek.com/gensim/index.html
B. Ribeiro-Neto, E. Moura, M. Neubert, and N. Ziviani. Efficient distributed algorithms to build inverted files. ACM SIGIR Conference, pp. 105–112, 1999.
M. Richardson, A. Prakash, and E. Brill. Beyond PageRank: machine learning for static ranking. World Wide Web Conference, pp. 707–715, 2006.
S. Robertson. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60, pp. 503–520, 2004.
S. Robertson and K. Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), pp. 129–146, 1976.
S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. ACM CIKM Conference, pp. 42–49, 2004.
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval, Technical Report 87–881, Cornell University, 1987. https://ecommons.cornell.edu/bitstream/handle/1813/6721/87-881.pdf?sequence=1
G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11), pp. 613–620, 1975.
H. Samet. Foundations of multidimensional and metric data structures. Morgan Kaufmann, 2006.
P. Saraiva, E. Silva de Moura, N. Ziviani, W. Meira, R. Fonseca, and B. Riberio-Neto. Rank-preserving two-level caching for scalable search engines. ACM SIGIR Conference, pp. 51–58, 2001.
F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. ACM SIGIR Conference, pp. 222–229, 2002.
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. ACM SIGIR Conference, pp. 21–29, 1996.
K. Spärck Jones. A statistical interpretation of term specificity and its application in information retrieval. Journal of Documentation, 28(1), pp. 11–21, 1972.
K. Spärck Jones, S. Walker, and S. Robertson. A probabilistic model of information retrieval: development and comparative experiments: Part 2. Information Processing and Management, 36(6), pp. 809–840, 2000.
M. Taylor, H. Zaragoza, N. Craswell, S. Robertson, and C. Burges. Optimisation methods for ranking functions with multiple parameters. ACM CIKM Conference, pp. 585–593, 2006.
C. J. van Rijsbergen. Information retrieval. Butterworths, London, 1979.
H. Williams, J. Zobel, and D. Bahle. Fast phrase querying with combined indexes. ACM Transactions on Information Systems, 22(4), pp. 573–594, 2004.
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and indexing documents and images. Morgan Kaufmann, 1999.
Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average precision. ACM SIGIR Conference, pp. 271–278, 2007.
C. Zhai. Statistical language models for information retrieval. Synthesis Lectures on Human Language Technologies, 1(1), pp. 1–141, 2008.
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), pp. 179–214, 2004.
J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. World Wide Web Conference, pp, 387–396, 2008.
J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software: Practice and Experience, 25(3), pp. 331–345, 1995.
J. Zobel and P. Dart. Phonetic string matching: Lessons from information retrieval. ACM SIGIR Conference, pp. 166–172, 1996.
J. Zobel, A. Moffat, and K. Ramamohanarao. Inverted files versus signature files for text indexing. ACM Transactions on Database Systems, 23(4), pp. 453–490, 1998.
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys (CSUR), 38(2), 6, 2006.
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Aggarwal, C.C. (2018). Information Retrieval and Search Engines. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-73531-3_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73530-6
Online ISBN: 978-3-319-73531-3
eBook Packages: Computer ScienceComputer Science (R0)