Skip to main content
Log in

Answering keyword queries through cached subqueries in best match retrieval models

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Caching is one of the techniques that Information Retrieval Systems (IRS) and Web Search Engines (WSEs) use to reduce processing costs and attain faster response times. In this paper we introduce Top-K SCRC (Set Cover Results Cache), a novel technique for results caching which aims at maximizing the utilization of cache. Identical queries are treated as in plain results caching (i.e. their evaluation does not require accessing the index), while combinations of cached sub-queries are exploited as in posting lists caching, however the exploited subqueries are not necessarily single-word queries. The problem of finding the right set of cached subqueries to answer an incoming query, is actually the Exact Set Cover problem. This technique can be applied in any best match retrieval model that is based on a decomposable scoring function, and we show that several best-match retrieval models (i.e VSM, Okapi BM25 and hybrid retrieval models) rely on such scoring functions. To increase the capacity (in queries) of the cache only the top-K results of each cached query are stored and we introduce metrics for measuring the accuracy of the composed top-K answer. By analyzing queries submitted to real-world WSEs, we verified that there is a significant proportion of queries whose terms is the result of a union of the terms of other queries. The comparative evaluation over traces of real query sets showed that the Top-K SCRC is on the average two times faster than a plain Top-K RC for the same cache size.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. Regarding semi-decomposable scoring functions, the SCRC structure does not need to change at all. The score of each document in a cached answer is its decomposable query-dependent score.

  2. In query independent ranking models we do not have this issue.

  3. www.excite.com

  4. www.altavista.com

  5. www.alltheweb.com

References

  • Baeza-Yate, R., Junqueira, F., Plachouras, V., Witschel, H. (2007). Admission policies for caches of search engine results. String Processing and Information Retrieval, (pp. 74–85). Springer.

  • Baeza-Yates, R., Gionis, A., Junqueira, F.P., Murdock, V., Plachouras, V., Silvestri, F. (2008). Design trade-offs for search engine caching. ACM Transactions on the Web, 2(4), 1–28. doi:10.1145/1409220.1409223.

    Article  Google Scholar 

  • Baeza-Yates, R., Junqueira, F., Plachouras, V., Witschel, H. (2007). Admission policies for caches of search engine results. Lecture Notes in Computer Science, 4726, 74.

    Article  MathSciNet  Google Scholar 

  • Baeza-Yates, R., & Saint-Jean, F. (2003). A three level search engine index based in query log distribution. String Processing and Information Retrieval, (pp. 56–65). Springer.

  • Baeza-Yates, R.A., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., Silvestri, F. (2007). The impact of caching on search engines. SIGIR, (pp. 183–190).

  • Baeza-Yates, R.A., & Saint-Jean, F. (2003). A three level search engine index based in query log distribution. SPIRE, (pp. 56–65).

  • Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J. (2003). Efficient query evaluation using a two-level retrieval process. CIKM ’03: Procs of the 12th intern. conf. on Information and knowledge management, (pp. 426–434). New York: ACM.

  • Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C. (2001). Introduction to Algorithms, 2nd edn: The MIT Press and McGraw-Hill Book Company.

  • Fafalios, P., Kitsos, I., Marketakis, Y., Baldassarre, C., Salampasis, M., Tzitzikas, Y. (2012). Web searching with entity mining at query time. Multidisciplinary Information Retrieval, 73–88.

  • Fagni, T., Perego, R., Silvestri, F., Orlando, S. (2006). Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems, 24(1), 51–78.

    Article  Google Scholar 

  • Jansen, B., & Pooch, U. (2000). A review of web searching studies and a framework for future research. Journal of the American Society for Information Science and Technology, 52(3), 235–246.

    Article  Google Scholar 

  • Jansen, B., & Spink, A. (2005). An analysis of web searching by european alltheweb. com users. Information Processing & Management, 41(2), 361–381.

    Article  Google Scholar 

  • Jansen, B., Spink, A., Saracevic, T. (2000). Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing & Management, 36(2), 207–227.

    Article  Google Scholar 

  • Jansen, B.J., Spink, A., Saracevic, T. (2000). Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing and Management, 36(2), 207–227. doi:10.1016/S0306-4573(99)00056-4.

    Article  Google Scholar 

  • Karp, R. (1972). Reducibility among combinatorial problems. Complexity of Computer Computations, 43, 85–103.

    Article  MathSciNet  Google Scholar 

  • Lempel, R., & Moran, S. (2003). Predictive caching and prefetching of query results in search engines. Procs of the 12th intern. conf. on World Wide Web, (pp. 19–28). New York: ACM.

  • Long, X., & Suel, T. (2006). Three-Level Caching for Efficient Query Processing in Large Web Search Engines. World Wide Web, 9(4), 369–395.

    Article  Google Scholar 

  • Ma, H., & Wang, B. (2012). User-aware caching and prefetching query results in web search engines. Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, (pp. 1163–1164). ACM.

  • Markatos, E. (2001). On caching search engine query results. Computer Communications, 24(2), 137–143.

    Article  Google Scholar 

  • Papadakos, P., Armenatzoglou, N., Kopidaki, S., Tzitzikas, Y. (2012). On exploiting static and dynamically mined metadata for exploratory web searching. Knowledge and Information Systems, 30(3), 493–525.

    Article  Google Scholar 

  • Papadakos, P., Theoharis, Y., Marketakis, Y., Armenatzoglou, N., Tzitzikas, Y. (2008). Mitos: Design and evaluation of a dbms-based web search engine. Informatics, 2008. PCI’08. Panhellenic Conference on, (pp. 49–53): IEEE.

  • Saraiva, P.C., de Moura, E.S., Fonseca, R.C., Wagner, M., Ribeiro-Neto, B.A., Ziviani, N. (2001). Rank-preserving two-level caching for scalable search engines. SIGIR, (pp. 51–58).

  • Silverstein, C., Marais, H., Henzinger, M., Moricz, M. (1999). Analysis of a very large web search engine query log. SIGIR Forum, 33(1), 6–12. doi:10.1145/331403.331405.

    Article  Google Scholar 

  • Silverstein, C., Marais, H., Henzinger, M., Moricz, M. (1999). Analysis of a very large web search engine query log. SIGIR Forum, 33 (1), 6–12. doi:10.1145/331403.331405.

    Article  Google Scholar 

  • Skobeltsyn, G., Junqueira, F., Plachouras, V., Baeza-Yates, R.A. (2008). Resin: a combination of results caching and index pruning for high-performance web search engines. SIGIR, (pp. 131–138).

  • Soffer, A., Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici, M., Maarek, Y.S. (2001). Static index pruning for information retrieval systems. SIGIR, (pp. 43–50).

  • Tzitzikas, Y., Spyratos, N., Constantopoulos, P. (2005). Mediators over taxonomy-based information sources. The VLDB Journal, 14(1), 112–136.

    Article  Google Scholar 

  • Vazirani, V. (2001). Approximation algorithms: Springer.

  • Xie, Y., & O’Hallaron, D. (2002). Locality in search engine queries and its implications for caching. INFOCOM 2002. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE, vol. 3, pp. 1238–1247. IEEE.

  • Xie, Y., & O’Hallaron, D.R. (2002). Locality in search engine queries and its implications for caching. INFOCOM.

  • Yao, Y., & Yao, B. (2012). Covering based rough set approximations. Information Sciences, 200, 91–107.

    Article  MATH  MathSciNet  Google Scholar 

  • Zhang, J., Long, X., Suel, T. (2008). Performance of compressed inverted list caching in search engines. WWW, (pp. 387–396).

  • Zobel, J., & Moffat, A. (2006). Inverted files for text search engines. ACM Computing Surveys, 38(2), 6. doi:10.1145/1132956.1132959.

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thank Jim Jansen for providing us the query logs of the Excite, the Altavista and AllTheWeb.com WSE. Many thanks also to V. Christophides and E. Markatos for their fruitful comments and suggestions on earlier stages of this work, to Panagiotis Papadakos and Christina Lantzaki for proof reading the manuscript, as well as to the anonymous reviewers for their constructive comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yannis Tzitzikas.

Appendices

Appendix A: Top-K SCRC: proofs

Proposition 1

The computation of the values K ex and K ro is correct.

Proof

( K e x ) Let 〈e 1,e 2,e 3,e 4〉 be the ordering of document wrt their certain score. We start from the start and we proceed as long as s c o r e c e r t (e i )≥M i s s i n g up. Recall that M i s s i n g up is the maximum score that an unknown document, say e , can have. It follows that s c o r e c e r t (e i )≥s c o r e(e ). Since s c o r e(e i )≥s c o r e c e r t (e i ), it follows that s c o r e(e i )≥s c o r e(e ). It is therefore obvious that S e t(t o p(K e x )(A n s c a c h e (q))) are certainly the K e x most highly scored elements of A n s(q). ( K r o ) We start from the start of the list and we proceed as long as s c o r e c e r t (e i )≥F o u n d up(BV) where V are the visited elements of the list so far. This means that we proceed as long the following inequalities hold:

$$\begin{array}{@{}rcl@{}} score_{cert}(e_{1}) &\geq& Found^{up}(\{e_{2}, e_{3}, e_{4}\}) \\ score_{cert}(e_{2}) &\geq& Found^{up}(\{e_{3}, e_{4}\}) \\ score_{cert}(e_{3}) &\geq& Found^{up}(\{e_{4}\}) \\ score_{cert}(e_{4}) &\geq& Found^{up}(\emptyset) \end{array} $$

Recall that F o u n d up(X) is the maximum upper bound of the scores of the elements in X. So if the first inequality holds then this means that it is impossible that one of {e 2,e 3,e 4} has a score that is greater than s c o r e c e r t (e 1). Let’s assume that the first two (of the four) inequalities hold. They imply that s c o r e(e 1)≥s c o r e(e 2)≥s c o r e(e 3). So the relative order of {e 1,e 2,e 3} is correct, i.e. as in \(Ans(q)_{|\{e_{1}, e_{2}, e_{3}\}}\). □

Appendix B: Quality of the approximation

Here we will discuss the quality of the approximation returned by the greedy algorithm. Let U be the query terms, and let R e m(U) be the uncovered elements of U if we run an exhaustive exact set cover algorithm. If an exact set cover exists then obviously |R e m(U)| = 0, and recall that this decision problem is NP-Complete. Let R e m G (U) be the uncovered elements of U if we run the greedy algorithm. Since we would like to cover all elements of U (for answering the incoming query only by cached queries), for evaluating the quality of the outcome of the greedy algorithm, we can use the following metric:

$$Z = \frac{|Rem_{G}(U)|-|Rem(U)|}{|U|} $$

Clearly, Z=0 if either both algorithms return an exact cover, or none of them returns an exact cover but each of them leaves the same number of uncovered elements. If the greedy leaves more uncovered elements (than the exhaustive algorithm), then Z>0. Obviously, 1 is the worst value for Z corresponding to the case where no elements of U are covered. However we will refine the upper bound later on.

Let F={S 1,…,S k } be a family of subsets of U (i.e. the lower queries in our problem). Now suppose that there is an exact cover (so the exhaustive algorithms returns YES and the particular exact cover). Below we will discuss the quality of approximation (using Z) of the greedy algorithm for various cases:

  • Singleton subqueries. If for each uU, there is an S i in F such that S i ={u} then Z=0. The proof is trivial.

  • If | U |=2 then Z =0. The proof is trivial (either F contains an identical query, or two singleton subqueries).

  • Case | U |=3. Here Z can be greater than zero. For example, consider U={a,b,c} and suppose F={{a,b},{a},{b,c}}. An ESC exists, but the greedy could fail (if it selects {a,b} at its first iteration). Here we have \(Z = \frac {1-0}{3}=\frac {1}{3}\). In general we can say that an ESC of a set of terms U={a,b,c} can be one of the following:

    1. 1.

      {{a,b,c}}

    2. 2.

      {{a,b}, {c}}

    3. 3.

      {{a}, {b,c}}

    4. 4.

      {{a,c}, {b}}

    5. 5.

      {{a}, {b}, {c}}

    In cases 1 and 5 the greedy algorithm will not fail. In cases 2, 3, 4 it can fail: case 2: if F contains the set {a,c}, case 3: if F contains the set {a,b}, case 4: if F contains the set {a,b}. It follows that in at most 3 out of the 5 cases, it can fail, and in each such case |R e m G (U)|−|R e m(U)| would be 1. It follows that the expected value of Z (if we assume that an ESC exists) is at least 2/5∗0+3/5∗1=3/5=0.6.

  • Case | U |≤3. This is the most frequent case, as previous works on query length analysis have shown (Table 19 summarizes such results). It was also proved by our measurements; the analysis of the logs, that are presented in Section 7, shows that in 87 % of the queries we have |U|≤3. Specifically, 27.7 % are single word queries, 38 % are two word queries, and 21 % are three word queries. According to the aforementioned cases, and based on the frequencies of the queries in our query logs, we get that the expected value of Z for queries with up to 3 words is: E(Z)=P(|U|=1)∗0+P(|U|=2)∗0+P(|U|=3)∗0.6=0.277/0.87∗0+0.38/0.87∗0+0.21/0.87∗0.6=0.144.

  • The general case (including | U |>3). If an ESC exists then F contains at least one subset of U. In this case the greedy algorithm will select the biggest. Let m=m a x|S i | (note that 1≤m≤|U|). If the greedy fails to find any other set, then this means that \(Z = (|U|-m - 0)/|U| = (|U|-m)/|U| = 1 - \frac {m}{|U|}\). It follows that this is an upper bound of Z for the general case.

Table 19 Query log analysis of previous works

Appendix C: An indicative experimental comparison with PLC

Here we report extra experimental results which demonstrate that SCRC is faster than the RC and the PLC. We used the same experimental setup as the one described in Section 7.6 and tested the performance of the SCRC using the same datasets over the Mitos WSE (Excite, Altavista). The same training sets were used for filling the PLC cache and the same test sets were used for submitting these queries to the Mitos WSE and evaluating its performance.

Each uncompressed posting list I(t) of a term t in the posting lists cache (PLC) consists of pairs of the form (d i ,t f d i,t ). We selected the terms to be cached, using the Q T F D F scheme proposed in Baeza-Yates et al. (2007) which suggests caching the terms with the highest \(\frac {pop(t)}{df(t)}\) ratio, where p o p(t) is the popularity of the term t in the evaluation queries. It has been shown that this scheme outperforms other schemes (i.e. caching the terms with the highest query-term frequencies (Baeza-Yates and Saint-Jean 2003), referred as Q T F in Baeza-Yates et al. (2007)).

Figures 14 and 15 illustrate the average query response time of the Mitos WSE when employing the RC, the PLC and the SCRC over the Altavista set and the Excite set respectively. The first column depicts the average query evaluation time by using the answer of the index (no cache). The second, the third and the fourth column show the average response time of Mitos WSE when using the PLC, the RC and the SCRC respectively.

Fig. 14
figure 14

Average query response time over the Altavista query set (in ms)

Fig. 15
figure 15

Average query response time over the Excite query set (in ms)

We observe that the RC is faster than the PLC and the SCRC outperforms the RC and the PLC in all cases. As the cache size increases, the speedup obtained by SCRC is significantly higher than the one obtained by the RC and the PLC.

In Fig. 14, we observe that when the cache size is medium (i.e M = 1 MB), SCRC is at least 2 times faster than the RC and 3 times faster than the PLC. When the cache size is large (i.e M=10M B), SCRC is 4 times faster than the RC and 5 times faster than the PLC.

In Fig. 15, we observe that when the cache size is medium (i.e M=5M B), SCRC is 2 times faster than the RC and the PLC. When the cache size is larger (i.e M=15M B), SCRC is 3 times faster than the RC and the PLC.

Appendix D: Decomposability and exact set cover

In the following example, we show why a plain set cover would lead to the wrong computation of the scores of the documents in the final answer A n s(q) of a query q in a best-match retrieval model.

Example 1

Consider a document collection that consists of only one document d, where d = “barack obama nobel prize”, and that the WSE uses the varied decomposable scoring function of the Vector Space Model (the one described in Section 3 when ignoring W q ) for assigning scores to the matching documents. The score of a document d w.r.t the query q is given by:

$$\begin{array}{@{}rcl@{}} Sim'_{cos}(d,q) &=& \frac{\sum\limits_{t\in t(d) \cap t(q)}{w_{d,t} \cdot w_{q,t}}} {W_{d}} \end{array} $$

Assume queries \(q_{c_{1}}\)=“barack obama” and \(q_{c_{2}}\)=“obama nobel prize”.

Let us compute the score of the document d for each of these queries. The vector of the document d is \(\overrightarrow {d} =\{1,1,1,1 \}\) and the vectors of the queries \(q_{c_{1}}\) and \(q_{c_{2}}\) are \(\overrightarrow {q_{c_{1}}}=\{1,0,1,0\}\) and \(\overrightarrow {q_{c_{2}}}=\{ 0,1,1,1\}\) respectively. The score of the document d w.r.t to queries \(q_{c_{1}}\) and \(q_{c_{2}}\) is:

$$\begin{array}{@{}rcl@{}} Sim^{\prime}_{cos}(d,q_{c_{1}}) &=& \frac{1 * 1 + 0 *0+ 1*1 +0 *0} {\sqrt{2}} = \frac{2}{\sqrt{2}}=\sqrt{2}=1.41 \\ Sim^{\prime}_{cos}(d,q_{c_{2}}) &=& \frac{1 * 0 + 1 *1+ 1*1 +1 *1} {\sqrt{2}} = \frac{3}{\sqrt{2}}=3* \frac{\sqrt{2}}{2}=2.12 \end{array} $$

Hence, \(Ans(q_{c_{1}})=\{(d,1.41)\}\) and \(Ans(q_{c_{2}})=\{(d,2.12)\}\).

Assume that the query q= “barack obama nobel prize” is submitted to the WSE. Then, \(C = \{q_{c_{1}}, q_{c_{2}}\}\) is a set cover of t(q), since \(t(q_{c_{1}}) \bigcup t(q_{c_{2}}) = t(q)\). However, C is not an exact set cover, since \(t(q_{c_{1}}) \cap t(q_{c_{2}}) \neq \emptyset \). We now show why the score of the document d derived through the set cover C is not equal to S i m c o s′(d,q). The vector of the query q is \(\overrightarrow {q} =\{ 1,1,1,1\}\) and the score of the document d w.r.t to q is computed as:

$$\begin{array}{@{}rcl@{}} Sim^{\prime}_{cos}(d,q) &=& \frac{1 * 1 + 1 *1+ 1*1 +1 *1} {\sqrt{2}} = \frac{4}{\sqrt{2}}=4\frac{\sqrt{2}}{2}=2 \sqrt{2}=2.852 \\ \end{array} $$

Now we compute the score of the documents in A n s(q), which is the union of the documents in the answers of the queries in C. Hence, A n s(q)=d.

The score S c o r e(d,q) of a document dA n s C (q) is the sum of the scores that d received in the answers of C. Hence, we have that:

$$\begin{array}{@{}rcl@{}} Score(d,q) &=& \sum\limits_{q_{c} \in C}{Score(d,q_{c})}=1.41 + 2.12 = 3.53 \neq Sim'_{cos}(d,q) \end{array} $$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Papadakis, M., Tzitzikas, Y. Answering keyword queries through cached subqueries in best match retrieval models. J Intell Inf Syst 44, 67–106 (2015). https://doi.org/10.1007/s10844-014-0330-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-014-0330-7

Keywords

Navigation