Abstract
Retrievability is a measure of access that quantifies how easily documents can be found using a retrieval system. Such a measure is of particular interest within the patent domain, because if a retrieval system makes some patents hard to find, then patent searchers will have a difficult time retrieving these patents. This may mean that a patent searcher could miss important and relevant patents because of the retrieval system. In this chapter, we describe measures of retrievability and how they can be applied to measure the overall access to a collection given a retrieval system. We then identify three features of best-match retrieval models that are hypothesised to lead to an improvement in access to all documents in the collection: sensitivity to term frequency, length normalization and convexity. Since patent searchers tend to favour Boolean models over best-match models, hybrid retrieval models are proposed that incorporate these features while preserving the desirable aspects of the traditional Boolean model. An empirical study conducted on four large patent corpora demonstrates that these hybrid models provide better access to the corpus of patents than the traditional Boolean model.
Notes
- 1.
Paradoxically the output of such a model is either 1 or 0 and this contains less information than the real number yielded by best-match models.
References
Arampatzis A, Kamps J, Koolen M, Nussbaum N (2007) Access to legal documents: Exact match, best match and combinations. In: TREC 2007: NIST special publication 500-274: The sixteenth text retrieval conference proceedings. NIST, Gaithersburg
Azzopardi L, Bache R (2010) On the relationship between effectiveness and accessibility. In: 33rd international ACM SIGIR conference on research and development in information retrieval, 19–23 Jul 2010, Geneva, Switzerland
Azzopardi L, Vinay V (2008) Accessibility in information retrieval. In: Advances in information retrieval ECIR 2008, Glasgow, UK, March 30–April 3. Springer, Berlin, pp 482–489
Azzopardi L, Vinay V (2008) Document accessibility: Evaluating the access afforded to a document by the retrieval system. In: Evaluation workshop at the European conference in information retrieval, Glasgow, UK, March 30–April 3
Azzopardi L, Vinay V (2008) Evaluation methods for information access tasks. In: CIKM 2008 proceedings of the 17th ACM international conference on information and knowledge management, California, US, 26–30 October. ACM Press, New York
Azzopardi L, Vanderbauwhede W, Joho H (2010) A survey of patent analysts’ search requirements. In: Proceedings of the 33th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2010), Geneva, Switzerland, pp. 775–776
Bache R, Azzopardi L (2010) Identifying retrievability-improving model features to enhance boolean search for patent retrieval. In: Proceedings of the 1st international workshop on the advances in patent information retrieval
Bashir S, Rauber A (2009) Improving retrievability of patents with cluster-based pseudo-relevance feedback documents selection. In: Proceedings of the 18th ACM conference on information and knowledge management (CIKM2009), Hong Kong, November 2009. ACM, New York
Bashir S, Rauber A (2010) Improving retrievability of patents in prior-art search. In: Advances in information retrieval. Lecture notes in computer science, vol 5993, pp. 457–470
Bonino D, Ciaramella A, Corno F (2010) Review of the state-of-the-art in patent information and forthcoming evolutions in intelligent patent informatics. World Pat Inf 32(1):30–38
Fang H, Tao T, Zhai C (2004) A formal study of information retrieval heuristics. In: SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 49–56
Gastwirth J (1972) The estimation of the Lorenz curve and Gini index. Rev Econ Stat 54:306–316
Hunt D, Nguyen L, Rodgers M (2007) Patent searching: Tools and techniques. Wiley, New York
Joho H, Azzopardi L, Vanderbauwhede W (2010). A survey of patent users: An analysis of tasks, behavior, search functionality and system requirements. In: Proceedings of the 3rd symposium on information interaction in context (IIiX 2010) 54(3):306–316
Ma H, Chandrasekar R, Quirk C, Gupta A (2009) Improving search engines using human computation games. In: CIKM ’09: Proceeding of the 18th ACM conference on information and knowledge management, pp 275–284
Manning C, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge
Matrixware research collection. http://www.ir-facility.org/research/data/matrixware-research-collection, Last visited 2010
Salton G, Fox E, Wu H (1983) Extended boolean information retrieval. Commun ACM, 1022–1036
Spärk Jones K (2004) A statistical interpretation of term specificity and its application in retrieval. J Doc 60(5):779–840
Spärk Jones K, Walker S, Robertson SE (2000) A probabilistic model of information retrieval: Development and comparative experiments (parts 1 and 2). Inf Process Manag 36(6):493–502
The lemur toolkit. http://trec.nist.gov/data.html, Last visited 2010
Tseng YH, Wu YJ (2008) A study of search tactics for patentability search: A case study on patent engineers. In: PaIR ’08: Proceeding of the 1st ACM workshop on patent information retrieval. ACM, New York, pp 33–36
Acknowledgements
This work described in this chapter was supported and partly funded by Matrixware. I would like to thank the Information Retrieval Facility for their computation services. I would also like to thank Leif Azzopardi, Tamara Polajnar, Richard Glassey and Desmond Elliott for their helpful comments and suggestions on how to improve this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Bache, R. (2011). Measuring and Improving Access to the Corpus. In: Lupu, M., Mayer, K., Tait, J., Trippe, A. (eds) Current Challenges in Patent Information Retrieval. The Information Retrieval Series, vol 29. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19231-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-19231-9_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19230-2
Online ISBN: 978-3-642-19231-9
eBook Packages: Computer ScienceComputer Science (R0)