Abstract
The Selective Search approach processes large document collections efficiently by partitioning the collection into topically homogeneous groups (shards), and searching only a few shards that are estimated to contain relevant documents for the query. The ability to identify the relevant shards for the query, directly impacts Selective Search performance. We thus investigate three new approaches for the shard ranking problem, and three techniques to estimate how many of the top shards should be searched for a query (shard rank cutoff estimation). We learn a highly effective shard ranking model using the popular learning-to-rank framework. Another approach leverages the topical organization of the collection along with pseudo relevance feedback (PRF) to improve the search performance further. Empirical evaluation using a large collection demonstrates statistically significant improvements over strong baselines. Experiments also show that shard cutoff estimation is essential to balance search precision and recall.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aly, R., Hiemstra, D., Demeester, T.: Taily: shard selection using the tail of score distributions. In: Proceedings of the SIGIR Conference, pp. 673–682. ACM (2013)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Burges, C.J.C.: From ranknet to lambdarank to lambdamart: an overview. Learning 11(23–581), 81 (2010)
Callan, J., Lu, Z., Croft, B.: Searching distributed collections with inference networks. In: Proceedings of the SIGIR Conference, pp. 21–28. ACM (1995)
Callan, J.: Distributed information retrieval. In: Croft, W.B. (ed.) Advances in Information Retrieval. The Information Retrieval Series, vol. 7, pp. 127–150. Springer, Boston (2002). doi:10.1007/0-306-47019-5_5
Chuang, M.-S., Kulkarni, A.: Balancing precision and recall with selective search. In: Proceedings of 4th Annual International Symposium on Information Management and Big Data (2017)
Croft, B., Callan, J.: Lemur project (2000)
Dai, Z., Kim, Y., Callan, J.: Learning to rank resources. In: Proceedings of the SIGIR Conference, pp. 837–840. ACM (2017)
Dang, V.: Lemur project components: Ranklib (2013)
Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 4, 933–969 (2003)
Gravano, L., Garcia-Molina, H.: Generalizing gloss to vector-space databases and broker hierarchies. Technical report, Stanford InfoLab (1999)
Gravano, L., Garcia-Molina, H., Tomasic, A.: The effectiveness of gioss for the text database discovery problem. ACM SIGMOD Rec. 23, 126–137 (1994). ACM
Gravano, L., García-Molina, H., Tomasic, A.: Gloss: text-source discovery over the internet. ACM Trans. Database Syst. (TODS) 24(2), 229–264 (1999)
Kanoulas, E., Dai, K., Pavlu, V., Aslam, J.A.: Score distribution models: assumptions, intuition, and robustness to score manipulation. In: Proceedings of the SIGIR Conference, pp. 242–249. ACM (2010)
Kim, Y., Callan, J., Culpepper, J.S., Moffat, A.: Does selective search benefit from WAND optimization? In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 145–158. Springer, Cham (2016). doi:10.1007/978-3-319-30671-1_11
Kim, Y., Callan, J., Culpepper, J.S., Moffat, A.: Efficient distributed selective search. Inf. Retr. J. 20(3), 221–252 (2017)
Kulkarni, A.: ShRkC: shard rank cutoff prediction for selective search. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 337–349. Springer, Cham (2015). doi:10.1007/978-3-319-23826-5_32
Kulkarni, A., Callan, J.: Selective search: efficient and effective search of large textual collections. ACM Trans. Inf. Syst. (TOIS) 33(4), 17 (2015)
Kulkarni, A., Pedersen, T.: How many different “john smiths”, and who are they? In: AAAI, pp. 1885–1886 (2006)
Kulkarni, A., Tigelaar, A.S., Hiemstra, D., Callan, J.: Shard ranking and cutoff estimation for topically partitioned collections. In: Proceedings of the CIKM Conference, pp. 555–564. ACM (2012)
Markov, I., Crestani, F.: Theoretical, qualitative, and quantitative analyses of small-document approaches to resource selection. ACM Trans. Inf. Syst. (TOIS) 32(2), 9 (2014)
Shokouhi, M.: Central-rank-based collection selection in uncooperative distributed information retrieval. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 160–172. Springer, Heidelberg (2007). doi:10.1007/978-3-540-71496-5_17
Shokouhi, M., Si, L., et al.: Federated search. Found. Trends® Inf. Retr. 5(1), 1–102 (2011)
Si, L., Callan, J.: Relevant document distribution estimation method for resource selection. In: Proceedings of the SIGIR Conference, pp. 298–305. ACM (2003)
Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri: a language model-based search engine for complex queries. In: Proceedings of the International Conference on Intelligent Analysis, vol. 2, pp. 2–6. Citeseer, 2005
Thomas, P., Shokouhi, M.: Sushi: scoring scaled samples for server selection. In: Proceedings of the SIGIR Conference, pp. 419–426. ACM (2009)
van Rijsbergen, C.J.: Information Retrieval. Butterworth, London (1979)
Xu, J., Li, H.: Adarank: a boosting algorithm for information retrieval. In: Proceedings of the SIGIR Conference, pp. 391–398. ACM (2007)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Chuang, M.S., Kulkarni, A. (2017). Improving Shard Selection for Selective Search. In: Sung, WK., et al. Information Retrieval Technology. AIRS 2017. Lecture Notes in Computer Science(), vol 10648. Springer, Cham. https://doi.org/10.1007/978-3-319-70145-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-70145-5_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70144-8
Online ISBN: 978-3-319-70145-5
eBook Packages: Computer ScienceComputer Science (R0)