Advertisement

Journal of Intelligent Information Systems

, Volume 44, Issue 1, pp 67–106 | Cite as

Answering keyword queries through cached subqueries in best match retrieval models

  • Myron Papadakis
  • Yannis Tzitzikas
Article

Abstract

Caching is one of the techniques that Information Retrieval Systems (IRS) and Web Search Engines (WSEs) use to reduce processing costs and attain faster response times. In this paper we introduce Top-K SCRC (Set Cover Results Cache), a novel technique for results caching which aims at maximizing the utilization of cache. Identical queries are treated as in plain results caching (i.e. their evaluation does not require accessing the index), while combinations of cached sub-queries are exploited as in posting lists caching, however the exploited subqueries are not necessarily single-word queries. The problem of finding the right set of cached subqueries to answer an incoming query, is actually the Exact Set Cover problem. This technique can be applied in any best match retrieval model that is based on a decomposable scoring function, and we show that several best-match retrieval models (i.e VSM, Okapi BM25 and hybrid retrieval models) rely on such scoring functions. To increase the capacity (in queries) of the cache only the top-K results of each cached query are stored and we introduce metrics for measuring the accuracy of the composed top-K answer. By analyzing queries submitted to real-world WSEs, we verified that there is a significant proportion of queries whose terms is the result of a union of the terms of other queries. The comparative evaluation over traces of real query sets showed that the Top-K SCRC is on the average two times faster than a plain Top-K RC for the same cache size.

Keywords

Information retrieval Query Processing Retrieval Models Ranking Web search engines Query log analysis 

Notes

Acknowledgments

We would like to thank Jim Jansen for providing us the query logs of the Excite, the Altavista and AllTheWeb.com WSE. Many thanks also to V. Christophides and E. Markatos for their fruitful comments and suggestions on earlier stages of this work, to Panagiotis Papadakos and Christina Lantzaki for proof reading the manuscript, as well as to the anonymous reviewers for their constructive comments.

References

  1. Baeza-Yate, R., Junqueira, F., Plachouras, V., Witschel, H. (2007). Admission policies for caches of search engine results. String Processing and Information Retrieval, (pp. 74–85). Springer.Google Scholar
  2. Baeza-Yates, R., Gionis, A., Junqueira, F.P., Murdock, V., Plachouras, V., Silvestri, F. (2008). Design trade-offs for search engine caching. ACM Transactions on the Web, 2(4), 1–28. doi: 10.1145/1409220.1409223.CrossRefGoogle Scholar
  3. Baeza-Yates, R., Junqueira, F., Plachouras, V., Witschel, H. (2007). Admission policies for caches of search engine results. Lecture Notes in Computer Science, 4726, 74.CrossRefMathSciNetGoogle Scholar
  4. Baeza-Yates, R., & Saint-Jean, F. (2003). A three level search engine index based in query log distribution. String Processing and Information Retrieval, (pp. 56–65). Springer.Google Scholar
  5. Baeza-Yates, R.A., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., Silvestri, F. (2007). The impact of caching on search engines. SIGIR, (pp. 183–190).Google Scholar
  6. Baeza-Yates, R.A., & Saint-Jean, F. (2003). A three level search engine index based in query log distribution. SPIRE, (pp. 56–65).Google Scholar
  7. Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J. (2003). Efficient query evaluation using a two-level retrieval process. CIKM ’03: Procs of the 12th intern. conf. on Information and knowledge management, (pp. 426–434). New York: ACM.Google Scholar
  8. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C. (2001). Introduction to Algorithms, 2nd edn: The MIT Press and McGraw-Hill Book Company.Google Scholar
  9. Fafalios, P., Kitsos, I., Marketakis, Y., Baldassarre, C., Salampasis, M., Tzitzikas, Y. (2012). Web searching with entity mining at query time. Multidisciplinary Information Retrieval, 73–88.Google Scholar
  10. Fagni, T., Perego, R., Silvestri, F., Orlando, S. (2006). Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems, 24(1), 51–78.CrossRefGoogle Scholar
  11. Jansen, B., & Pooch, U. (2000). A review of web searching studies and a framework for future research. Journal of the American Society for Information Science and Technology, 52(3), 235–246.CrossRefGoogle Scholar
  12. Jansen, B., & Spink, A. (2005). An analysis of web searching by european alltheweb. com users. Information Processing & Management, 41(2), 361–381.CrossRefGoogle Scholar
  13. Jansen, B., Spink, A., Saracevic, T. (2000). Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing & Management, 36(2), 207–227.CrossRefGoogle Scholar
  14. Jansen, B.J., Spink, A., Saracevic, T. (2000). Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing and Management, 36(2), 207–227. doi: 10.1016/S0306-4573(99)00056-4.CrossRefGoogle Scholar
  15. Karp, R. (1972). Reducibility among combinatorial problems. Complexity of Computer Computations, 43, 85–103.CrossRefMathSciNetGoogle Scholar
  16. Lempel, R., & Moran, S. (2003). Predictive caching and prefetching of query results in search engines. Procs of the 12th intern. conf. on World Wide Web, (pp. 19–28). New York: ACM.Google Scholar
  17. Long, X., & Suel, T. (2006). Three-Level Caching for Efficient Query Processing in Large Web Search Engines. World Wide Web, 9(4), 369–395.CrossRefGoogle Scholar
  18. Ma, H., & Wang, B. (2012). User-aware caching and prefetching query results in web search engines. Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, (pp. 1163–1164). ACM.Google Scholar
  19. Markatos, E. (2001). On caching search engine query results. Computer Communications, 24(2), 137–143.CrossRefGoogle Scholar
  20. Papadakos, P., Armenatzoglou, N., Kopidaki, S., Tzitzikas, Y. (2012). On exploiting static and dynamically mined metadata for exploratory web searching. Knowledge and Information Systems, 30(3), 493–525.CrossRefGoogle Scholar
  21. Papadakos, P., Theoharis, Y., Marketakis, Y., Armenatzoglou, N., Tzitzikas, Y. (2008). Mitos: Design and evaluation of a dbms-based web search engine. Informatics, 2008. PCI’08. Panhellenic Conference on, (pp. 49–53): IEEE.Google Scholar
  22. Saraiva, P.C., de Moura, E.S., Fonseca, R.C., Wagner, M., Ribeiro-Neto, B.A., Ziviani, N. (2001). Rank-preserving two-level caching for scalable search engines. SIGIR, (pp. 51–58).Google Scholar
  23. Silverstein, C., Marais, H., Henzinger, M., Moricz, M. (1999). Analysis of a very large web search engine query log. SIGIR Forum, 33(1), 6–12. doi: 10.1145/331403.331405.CrossRefGoogle Scholar
  24. Silverstein, C., Marais, H., Henzinger, M., Moricz, M. (1999). Analysis of a very large web search engine query log. SIGIR Forum, 33 (1), 6–12. doi: 10.1145/331403.331405.CrossRefGoogle Scholar
  25. Skobeltsyn, G., Junqueira, F., Plachouras, V., Baeza-Yates, R.A. (2008). Resin: a combination of results caching and index pruning for high-performance web search engines. SIGIR, (pp. 131–138).Google Scholar
  26. Soffer, A., Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici, M., Maarek, Y.S. (2001). Static index pruning for information retrieval systems. SIGIR, (pp. 43–50).Google Scholar
  27. Tzitzikas, Y., Spyratos, N., Constantopoulos, P. (2005). Mediators over taxonomy-based information sources. The VLDB Journal, 14(1), 112–136.CrossRefGoogle Scholar
  28. Vazirani, V. (2001). Approximation algorithms: Springer.Google Scholar
  29. Xie, Y., & O’Hallaron, D. (2002). Locality in search engine queries and its implications for caching. INFOCOM 2002. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE, vol. 3, pp. 1238–1247. IEEE.Google Scholar
  30. Xie, Y., & O’Hallaron, D.R. (2002). Locality in search engine queries and its implications for caching. INFOCOM.Google Scholar
  31. Yao, Y., & Yao, B. (2012). Covering based rough set approximations. Information Sciences, 200, 91–107.CrossRefMATHMathSciNetGoogle Scholar
  32. Zhang, J., Long, X., Suel, T. (2008). Performance of compressed inverted list caching in search engines. WWW, (pp. 387–396).Google Scholar
  33. Zobel, J., & Moffat, A. (2006). Inverted files for text search engines. ACM Computing Surveys, 38(2), 6. doi: 10.1145/1132956.1132959.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Institute of Computer Science (ICS), Foundation for Research and Technology - Hellas (FORTH), Science and Technology Park of CreteHeraklion, CreteGreece
  2. 2.Computer Science DepartmentUniversity of CreteHeraklion, CreteGreece

Personalised recommendations