Advanced Information Retrieval Measures
Advanced information retrieval measures are effectiveness measures for various types of information access tasks that go beyond traditional document retrieval. Traditional document retrieval measures are suitable for set retrieval (measured by precision, recall, F-measure, etc.) or ad hoc ranked retrieval, the task of ranking documents by relevance (measured by average precision, etc.). Whereas, advanced information retrieval measures may work for diversified search (the task of retrieving relevant and diverse documents), aggregated search (the task of retrieving from multiple sources/media and merging the results), one-click access (the task of returning a textual multidocument summary instead of a list of URLs in response to a query), and multiquery sessions (information-seeking activities that involve query reformulations), among other tasks. Some advanced measures are based on user models that arguably better reflect real user behaviors than standard measures do.
More Graded Relevance Measures
Historically, information retrieval evaluation mainly dealt with binary relevance: relevant or nonrelevant. But after Järvelin and Kekäläinen proposed normalized discounted cumulative gain (nDCG) in 2002 , it quickly became the most widely used evaluation measure that utilizes graded relevance. For example, it is known that search engine companies routinely use (variants of) nDCG for internal evaluations. Interestingly, nDCG happens to be a variant of the normalized sliding ratio measure proposed by Pollock in 1968 . The advent of web search, where some web pages can be much more important than others (e.g., in home page finding), is probably one factor that made nDCG so popular.
nDCG, however, is not the only graded-relevance measure available. Q-measure, a simple variant of average precision, was proposed in 2004 . (It is called Q-measure because it was originally designed for factoid question answering with equivalence classes of correct answers.) Q-measure has been used in several tasks of NTCIR (NII Testbeds and Community for Information access Research http://research.nii.ac.jp/ntcir/): cross-lingual information retrieval, information retrieval for question answering, geotemporal information retrieval, and community question answering. Expected Reciprocal Rank (ERR), proposed in 2009 , has an intuitive user behavior model particularly suitable for navigational search where only a few relevant documents are required . ERR is now a popular evaluation measure.
Diversified Search Measures
If the user’s query is ambiguous or underspecified, it is difficult for a search engine to guess the intent of the user. To make the same first search result page an acceptable entry point for different users with different intents but sharing the same query, search result diversification has recently become a popular research topic. In 2003, Zai, Cohen, and Lafferty  defined subtopic retrieval as well as an evaluation measure called subtopic recall: given a query, how many of its possible intents does the set of retrieved documents cover? Subtopic recall is also known as intent recall . Diversified search effectiveness measures, α-nDCG  and ERR-IA , were proposed originally in 2008 and in 2009, respectively, and TREC launched the web track diversity task in 2009, using these measures for evaluating participating systems. As a third alternative to evaluating diversified search, D-measure  (an instance of which is D-nDCG) was proposed in 2011: unlike α-nDCG and ERR-IA, this new family of diversity measures is free from the NP-complete problem of computing the ideal ranked list. D-nDCG and its variant D♯-nDCG (a linear combination of D-nDCG and intent recall) have been used in the NTCIR INTENT task since 2011.
Beyond Document Retrieval: Beyond Single Queries
For some types of queries and search environments, returning a concise, direct answer may be more effective than returning the so-called ten blue links. Hence information retrieval evaluation should probably go beyond document retrieval and beyond evaluation based on document-level relevance. The task of returning a direct answer in response to a query is similar to question answering and query-focussed text summarization; for evaluating these related tasks, the evaluation unit is often nuggets or semantic content units which are more fine grained than documents or passages. S-measure  proposed in 2011 is an evaluation measure that in a way bridges the gap between textual output evaluation and information retrieval evaluation: while S-measure is based on nugget-level relevance, it also has a discounting mechanism similar to that of nDCG. While nDCG and other information retrieval measures discount the value of a relevant document based on its rank, S-measure discounts the value of a relevant piece of text based on its position within the textual output.
U-measure [12, 15], proposed in 2013, generalizes the idea of S-measure: it can evaluate question answering, query-focussed summarization, ad hoc and diversified document retrieval, and even multiquery sessions. The key concept behind U-measure is the trail text, which represents all pieces of texts read by the user during an information-seeking activity. U-measure discounts the value of each relevant text fragment based on how much text the user has read so far. Unlike common evaluation measures, U-measure does not rely on the linear traversal assumption, which says that the user always scans the ranked list from top to bottom. For example, given some click data, U-measure can quantify the difference between a linear traversal and a nonlinear traversal with the same set of clicks. Measures related to U-measure include Time-Biased Gain (TBG) by Smucker and Clarke  and session-based measures described in Kanoulas et al. , but the measures as described in these papers assume linear traversal. Another advantage of U-measure over rank-based effectiveness measures is that it can (just like TBG) consider realistic user behaviors such as reading snippets and reading documents of various lengths. Just like ERR, U-measure possesses the diminishing return property.
Evaluating Evaluation Measures
So which evaluation measures should researchers choose? The most important selection criterion is whether the measures are measuring what we want to measure: this may be tested to some extent through user studies. Another important aspect is the statistical stability of the measures: do the measures give us reliable conclusions? One moderately popular method for comparing measures from this viewpoint is the discriminative power proposed in 2006 : a significance test is conducted for every pair of systems, and the sorted p-values (i.e., the probability of obtaining the observed difference or something more extreme under the null hypothesis) are plotted against the system pairs. Another way to compare the statistical stability of measures is to utilize the sample size design technique : to achieve a given level of statistical power, how many topics does each evaluation measure require? The latter method enables us to compare measures in terms of practical significance, as the number of topics is basically proportional to the total relevance assessment cost. From the statistical viewpoint, it is known that measures like nDCG and Q-measure are much more reliable than others such as ERR; as for diversity measures, D-nDCG is much more reliable than α-nDCG and ERR-IA [11, 15]. While the diminishing return property of ERR is intuitive and important, this very property makes the measure rely on a small number of data points (i.e., retrieved relevant documents) and thereby hurts statistical stability.
In an ad hoc information retrieval task (i.e., ranking documents by relevance to a given query), it is assumed that we have a test collection consisting of a set of topics (i.e., search requests), a set of (graded) relevance assessments for each topic, and a target document corpus. The topic set is assumed to be a sample from the population, and systems are often compared in terms of mean Q-measure, mean ERR, and so on. In the case of diversified search, it is assumed that each topic has a known set of intents and that a set of (graded) relevance assessments is available for each intent. In addition, the intent probability given the query is assumed to be known for each intent. Systems are compared in terms of mean α-nDCG, mean ERR-IA, mean D-nDCG, etc. Confidence intervals, p-values, and effect sizes should be reported with these evaluation results .
In addition to the information available from a test collection, U-measure (just like TBG) requires the document lengths of retrieved documents or their estimates. U-measure may also be used with time-stamped clicks instead of relevance assessments whereby nonlinear traversals can be quantified unlike common effectiveness measures.
Advanced information retrieval measures are very important for designing effective search engines. The effective search engines should accommodate not only desktop PC users but also mobile and wearable device users.
The future of information retrieval evaluation has been discussed recently in a SIGIR Forum paper ; it is possible that information retrieval evaluation will move more toward information retrieval rather than document retrieval or “ten blue links” and toward mobile and ubiquitous information access rather than desktop.
Test collections for evaluating various information retrieval and access tasks are available from evaluation forums such as TREC (http://trec.nist.gov), NTCIR (http://research.nii.ac.jp/ntcir/), and CLEF (http://www.clef-initiative.eu/).
Tools for evaluating advanced inform- ation retrieval measures include trec_eval (http://trec.nist.gov/trec_eval/), ndeval (https://github.com/trec-web/trec-web-2013/tree/master/src/eval), and NTCIREVAL (http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html). A few tools for evaluating evaluation measures in terms of discriminative power are available from http://www.f.waseda.jp/tetsuya/tools.html.
- 2.Chapelle O, Metzler D, Zhang Y, Grinspan P. Expected reciprocal rank for graded relevance. In: ACM CIKM 2009, Hongkong. 2009. p. 621–30.Google Scholar
- 4.Clarke CLA, Craswell N, Soboroff I, Ashkan A. A comparative analysis of cascade measures for novelty and diversity. In: ACM WSDM 2011, Hong Kong. 2011. p. 75–84.Google Scholar
- 6.Kanoulas E, Carterette B, Clough PD, Sanderson M. Evaluating multi-query sessions. In: ACM SIGIR 2011, Beijing. 2011. p. 1026–53.Google Scholar
- 7.Moffat A, Zobel J. Rank-biased Precision for measurement of retrieval effectiveness. ACM TOIS. 2008;27(1):2:1–2:27.Google Scholar
- 9.Robertson SE, Kanoulas E, Yilmaz E. Extending average Precision to graded relevance judgments. In: ACM SIGIR 2010, Geneva, 2010. p. 603–10.Google Scholar
- 12.Sakai T, Dou Z. Summaries, ranked retrieval and sessions: a unified framework for information access evaluation. In: ACM SIGIR 2013, Dublin, 2013. p. 473–82.Google Scholar
- 13.Sakai T, Song R. Evaluating diversified search results using per-intent graded relevance. In: ACM SIGIR 2011, Beijing, 2011. p. 1043–52.Google Scholar
- 14.Sakai T, Kato MP, Song YI. Click the search button and be happy: evaluating direct and immediate information access. In: ACM CIKM 2011, Glasgow, 2011. p. 621–30.Google Scholar
- 15.Sakai T. Metrics, statistics, tests. In: PROMISE winter school 2013: bridging between information retrieval and databases, Bressanone. LNCS, vol 8173. 2014.Google Scholar
- 16.Smucker MD, Clarke CLA. Time-based calibration of effectiveness measures. In: ACM SIGIR 2012, Portland, 2012. p. 95–104.Google Scholar
- 17.Zhai C, Cohen WW, Lafferty J. Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In: ACM SIGIR 2003, Toronto, 2003. p. 10–7Google Scholar