Every year more than 25 test collections are built among the main Information Retrieval (IR) evaluation campaigns. They are extremely important in IR because they become the evaluation praxis for the forthcoming years. Test collections are built mostly using the pooling method. The main advantage of this method is that it drastically reduces the number of documents to be judged. It does so at the cost of introducing biases, which are sometimes aggravated by non optimal configuration. In this paper we develop a novel visualization technique for the pooling method, and integrate it in a demo application named Visual Pool. This demo application enables the user to interact with the pooling method with ease, and develops visual hints in order to analyze existing test collections, and build better ones.
Recent studies have reconsidered the way we operationalise the pooling method, by considering the practical limitations often encountered by test collection builders. The biggest constraint is often the budget available for relevance assessments and the question is how best – in terms of the lowest pool bias – to select the documents to be assessed given a fixed budget. Here, we explore a series of 3 new pooling strategies introduced in this paper against 3 existing ones and a baseline. We show that there are significant differences depending on the evaluation measure ultimately used to assess the runs. We conclude that adaptive strategies are always best, but in their absence, for top-heavy evaluation measures we can continue to use the baseline, while for P@100 we should use any of the other non-adaptive strategies.
The empirical nature of Information Retrieval (IR) mandates strong experimental practices. The Cranfield/TREC evaluation paradigm represents a keystone of such experimental practices. Within this paradigm, the generation of relevance judgments has been the subject of intense scientific investigation. This is because, on one hand, consistent, precise and numerous judgements are key to reduce evaluation uncertainty and test collection bias; on the other hand, however , relevance judgements are costly to collect. The selection of which documents to judge for relevance (known as pooling) has therefore great impact in IR evaluation. In this paper, we contribute a set of 8 novel pooling strategies based on retrieval fusion methods. We show that the choice of the pooling strategy has significant effects on the cost needed to obtain an unbiased test collection; we also identify the best performing pooling strategy according to three evaluation measure.
Pool bias is a well understood problem of test-collection based benchmarking in information retrieval. The pooling method itself is designed to identify all relevant documents. In practice, `all' translates to `as many as possible given some budgetary constraints' and the problem persists, albeit mitigated. Recently, methods to address this pool bias for previously created test collections have been proposed, for the evaluation measure precision at cut-off (P@n). Analyzing previous methods, we make the empirical observation that the distribution of the probability of providing new relevant documents to the pool, over the runs, is log-normal (when the pooling strategy is fixed depth at cut-off). We use this observation to calculate a prior probability of providing new relevant documents, which we then use in a pool bias estimator that improves upon previous estimates of precision at cut-off. Through extensive experimental results, covering 15 test collections, we show that the proposed bias correction method is the new state of the art, providing the closest estimates yet when compared to the original pool.
In Information Retrieval, test collections are usually built using the pooling method. Many pooling strategies have been developed for the pooling method. Herein, we address the question of identifying the best pooling strategy when evaluating systems using precision-oriented measures in presence of budget constraints on the number of documents to be evaluated. As a quality measurement we use the bias introduced by the pooling strategy, measured both in terms of Mean Absolute Error of the scores and in terms of ranking errors. Based on experiments on 15 test collections, we conclude that, for precision-oriented measures, the best strategies are based on Rank-Biased Precision (RBP). These results can inform collection builders because they suggest that, under fixed assessment budget constraints, RBP-based sampling produces less biased pools than other alternatives.
Recently, it has been discovered that it is possible to mitigate the Pool Bias of Precision at cutoff (P@n) when used with the fixed-depth pooling strategy, by measuring the effect of the tested run against the pooled runs. In this paper we extend this analysis and test the existing methods on different pooling strategies, simulated on a selection of 12 TREC test collections. We observe how the different methodologies to correct the pool bias behave, and provide guidelines about which pooling strategy should be chosen.
The offline evaluation of Information Retrieval (IR) systems is performed through the use of test collections. A test collection, in its essence, is composed of: a collection of documents, a set of topics and, a set of relevance assessments for each topic, derived from the collection of documents. Ideally, for each topic, all the documents of the test collection should be judged, but due to the dimensions of the collections of documents, and their exponential growth over the years, this practice soon became impractical. Therefore, early in IR history, this problem has been addressed through the use of the pooling method. The pooling method consists of optimizing the relevance assessment process by pooling the documents retrieved by different search engines following a particular pooling strategy. The most common one consists on pooling the top d documents of each run. The pool is constructed from systems taking part in a challenge for which the collection was made, at a specific point in time, after which the collection is generally frozen in terms of relevance judgments. This method leads to a bias called pool bias, which is the effect that documents that were not selected in the pool created from the original runs will never be considered relevant. Thereby, this bias affects the evaluation of a system that has not been part of the pool, with any IR evaluation measures, making the comparison with pooled systems unfair.
IR measures have evolved over the years and become more and more complex and difficult to interpret. Witnessing a need in industry for measures that `make sense', I focus on the problematics of the two fundamental IR evaluation measures, Precision at cut-off P@n and Recall at cut-off R@n. There are two reasons to consider such `simple' metrics: first, they are cornerstones for many other developed metrics and, second, they are easy to understand by all users. This ``understandability'' of the IR metrics has drawn moderate attention from our community recently. To the eyes of a practitioner, these two evaluation measures are interesting because they lead to more intuitive interpretations like, how much time people are reading useless documents (low precision), or how many relevant documents they are missing (low recall). But this last interpretation, due to the fact that recall is inversely proportional to the number of relevant documents per topic, is very difficult to be addressed if to be judged is just a portion of the collection of documents, as it is done when using the pooling method. To tackle this problem, another kind of evaluation has been developed, based on measuring how much an IR system makes documents accessible. Accessibility measures can be seen as a complementary evaluation to recall because they provide information on whether some relevant documents are not retrieved due to an unfairness in accessibility.
The main goal of this Ph.D. is to increase the stability and reusability of existing test collections, when to be evaluated are systems in terms of precision, recall, and accessibility. The outcome will be: the development of a novel estimator to tackle the pool bias issue for P@n, and R@n, a comprehensive analysis of the effect of the estimator on varying pooling strategies, and finally, to support the evaluation of recall, an analytic approach to the evaluation of accessibility measures.
BM25 is probably the most well known term weighting model in Information Retrieval. It has, depending on the formula variant at hand, 2 or 3 parameters (k1, b, and k3). This paper addresses b-the document length normalization parameter. Based on the observation that the two cases previously discussed for length normalization (multi-topicality and verboseness) are actually three: multi-topicality, verboseness with word repetition (repetitiveness) and verboseness with synonyms, we propose and test a new length normalization method that removes the need for a b parameter in BM25. Testing the new method on a set of purposefully varied test collections, we observe that we can obtain results statistically indistinguishable from the optimal results, therefore removing the need for ground-truth based optimization.
For many tasks in evaluation campaigns, especially those modeling narrow domain-specific challenges, lack of participation leads to a potential pooling bias due to the scarce number of pooled runs. It is well known that the reliability of a test collection is proportional to the number of topics and relevance assessments provided for each topic, but also to same extent to the diversity in participation in the challenges. Hence, in this paper we present a new perspective in reducing the pool bias by studying the effect of merging an unpooled run with the pooled runs. We also introduce an indicator used by the bias correction method to decide whether the correction needs to be applied or not. This indicator gives strong clues about the potential of a “good” run tested on an “unfriendly” test collection (i.e. a collection where the pool was contributed to by runs very different from the one at hand). We demonstrate the correctness of our method on a set of fifteen test collections from the Text REtrieval Conference (TREC). We observe a reduction in system ranking error and absolute score difference error.
Creating systematic reviews is a painstaking task undertaken especially in domains where experimental results are the primary method to knowledge creation. For the review authors, analysing documents to extract relevant data is a demanding activity. To support the creation of systematic reviews, we have created DASyR—a semi-automatic document analysis system. DASyR is our solution to annotating published papers for the purpose of ontology population. For domains where dictionaries are not existing or inadequate, DASyR relies on a semi-automatic annotation bootstrapping method based on positional Random Indexing, followed by traditional Machine Learning algorithms to extend the annotation set. We provide an example of the method application to a subdomain of Computer Science, the Information Retrieval evaluation domain. The reliance of this domain on large scale experimental studies makes it a perfect domain to test on. We show the utility of DASyR through experimental results for different parameter values for the bootstrap procedure, evaluated in terms of annotator agreement, error rate, precision and recall.