Recent studies have reconsidered the way we operationalise the pooling method, by considering the practical limitations often encountered by test collection builders. The biggest constraint is often the budget available for relevance assessments and the question is how best – in terms of the lowest pool bias – to select the documents to be assessed given a fixed budget. Here, we explore a series of 3 new pooling strategies introduced in this paper against 3 existing ones and a baseline. We show that there are significant differences depending on the evaluation measure ultimately used to assess the runs. We conclude that adaptive strategies are always best, but in their absence, for top-heavy evaluation measures we can continue to use the baseline, while for P@100 we should use any of the other non-adaptive strategies.
The empirical nature of Information Retrieval (IR) mandates strong experimental practices. The Cranfield/TREC evaluation paradigm represents a keystone of such experimental practices. Within this paradigm, the generation of relevance judgments has been the subject of intense scientific investigation. This is because, on one hand, consistent, precise and numerous judgements are key to reduce evaluation uncertainty and test collection bias; on the other hand, however , relevance judgements are costly to collect. The selection of which documents to judge for relevance (known as pooling) has therefore great impact in IR evaluation. In this paper, we contribute a set of 8 novel pooling strategies based on retrieval fusion methods. We show that the choice of the pooling strategy has significant effects on the cost needed to obtain an unbiased test collection; we also identify the best performing pooling strategy according to three evaluation measure.
Pool bias is a well understood problem of test-collection based benchmarking in information retrieval. The pooling method itself is designed to identify all relevant documents. In practice, `all' translates to `as many as possible given some budgetary constraints' and the problem persists, albeit mitigated. Recently, methods to address this pool bias for previously created test collections have been proposed, for the evaluation measure precision at cut-off (P@n). Analyzing previous methods, we make the empirical observation that the distribution of the probability of providing new relevant documents to the pool, over the runs, is log-normal (when the pooling strategy is fixed depth at cut-off). We use this observation to calculate a prior probability of providing new relevant documents, which we then use in a pool bias estimator that improves upon previous estimates of precision at cut-off. Through extensive experimental results, covering 15 test collections, we show that the proposed bias correction method is the new state of the art, providing the closest estimates yet when compared to the original pool.
In Information Retrieval, test collections are usually built using the pooling method. Many pooling strategies have been developed for the pooling method. Herein, we address the question of identifying the best pooling strategy when evaluating systems using precision-oriented measures in presence of budget constraints on the number of documents to be evaluated. As a quality measurement we use the bias introduced by the pooling strategy, measured both in terms of Mean Absolute Error of the scores and in terms of ranking errors. Based on experiments on 15 test collections, we conclude that, for precision-oriented measures, the best strategies are based on Rank-Biased Precision (RBP). These results can inform collection builders because they suggest that, under fixed assessment budget constraints, RBP-based sampling produces less biased pools than other alternatives.
Recently, it has been discovered that it is possible to mitigate the Pool Bias of Precision at cutoff (P@n) when used with the fixed-depth pooling strategy, by measuring the effect of the tested run against the pooled runs. In this paper we extend this analysis and test the existing methods on different pooling strategies, simulated on a selection of 12 TREC test collections. We observe how the different methodologies to correct the pool bias behave, and provide guidelines about which pooling strategy should be chosen.
The offline evaluation of Information Retrieval (IR) systems is performed through the use of test collections. A test collection, in its essence, is composed of: a collection of documents, a set of topics and, a set of relevance assessments for each topic, derived from the collection of documents. Ideally, for each topic, all the documents of the test collection should be judged, but due to the dimensions of the collections of documents, and their exponential growth over the years, this practice soon became impractical. Therefore, early in IR history, this problem has been addressed through the use of the pooling method. The pooling method consists of optimizing the relevance assessment process by pooling the documents retrieved by different search engines following a particular pooling strategy. The most common one consists on pooling the top d documents of each run. The pool is constructed from systems taking part in a challenge for which the collection was made, at a specific point in time, after which the collection is generally frozen in terms of relevance judgments. This method leads to a bias called pool bias, which is the effect that documents that were not selected in the pool created from the original runs will never be considered relevant. Thereby, this bias affects the evaluation of a system that has not been part of the pool, with any IR evaluation measures, making the comparison with pooled systems unfair.
IR measures have evolved over the years and become more and more complex and difficult to interpret. Witnessing a need in industry for measures that `make sense', I focus on the problematics of the two fundamental IR evaluation measures, Precision at cut-off P@n and Recall at cut-off R@n. There are two reasons to consider such `simple' metrics: first, they are cornerstones for many other developed metrics and, second, they are easy to understand by all users. This ``understandability'' of the IR metrics has drawn moderate attention from our community recently. To the eyes of a practitioner, these two evaluation measures are interesting because they lead to more intuitive interpretations like, how much time people are reading useless documents (low precision), or how many relevant documents they are missing (low recall). But this last interpretation, due to the fact that recall is inversely proportional to the number of relevant documents per topic, is very difficult to be addressed if to be judged is just a portion of the collection of documents, as it is done when using the pooling method. To tackle this problem, another kind of evaluation has been developed, based on measuring how much an IR system makes documents accessible. Accessibility measures can be seen as a complementary evaluation to recall because they provide information on whether some relevant documents are not retrieved due to an unfairness in accessibility.
The main goal of this Ph.D. is to increase the stability and reusability of existing test collections, when to be evaluated are systems in terms of precision, recall, and accessibility. The outcome will be: the development of a novel estimator to tackle the pool bias issue for P@n, and R@n, a comprehensive analysis of the effect of the estimator on varying pooling strategies, and finally, to support the evaluation of recall, an analytic approach to the evaluation of accessibility measures.
BM25 is probably the most well known term weighting model in Information Retrieval. It has, depending on the formula variant at hand, 2 or 3 parameters (k1, b, and k3). This paper addresses b-the document length normalization parameter. Based on the observation that the two cases previously discussed for length normalization (multi-topicality and verboseness) are actually three: multi-topicality, verboseness with word repetition (repetitiveness) and verboseness with synonyms, we propose and test a new length normalization method that removes the need for a b parameter in BM25. Testing the new method on a set of purposefully varied test collections, we observe that we can obtain results statistically indistinguishable from the optimal results, therefore removing the need for ground-truth based optimization.
For many tasks in evaluation campaigns, especially those modeling narrow domain-specific challenges, lack of participation leads to a potential pooling bias due to the scarce number of pooled runs. It is well known that the reliability of a test collection is proportional to the number of topics and relevance assessments provided for each topic, but also to same extent to the diversity in participation in the challenges. Hence, in this paper we present a new perspective in reducing the pool bias by studying the effect of merging an unpooled run with the pooled runs. We also introduce an indicator used by the bias correction method to decide whether the correction needs to be applied or not. This indicator gives strong clues about the potential of a “good” run tested on an “unfriendly” test collection (i.e. a collection where the pool was contributed to by runs very different from the one at hand). We demonstrate the correctness of our method on a set of fifteen test collections from the Text REtrieval Conference (TREC). We observe a reduction in system ranking error and absolute score difference error.
Creating systematic reviews is a painstaking task undertaken especially in domains where experimental results are the primary method to knowledge creation. For the review authors, analysing documents to extract relevant data is a demanding activity. To support the creation of systematic reviews, we have created DASyR—a semi-automatic document analysis system. DASyR is our solution to annotating published papers for the purpose of ontology population. For domains where dictionaries are not existing or inadequate, DASyR relies on a semi-automatic annotation bootstrapping method based on positional Random Indexing, followed by traditional Machine Learning algorithms to extend the annotation set. We provide an example of the method application to a subdomain of Computer Science, the Information Retrieval evaluation domain. The reliance of this domain on large scale experimental studies makes it a perfect domain to test on. We show the utility of DASyR through experimental results for different parameter values for the bootstrap procedure, evaluated in terms of annotator agreement, error rate, precision and recall.
We approach the problem of retrievability from an analytical perspective, starting with modeling conjunctive and disjunctive queries in a boolean model. We show that this represents an upper bound on retrievability for all other best match algorithms. We follow this with an observation of imbalance in the distribution of retrievability, using the Gini coefficient. Simulation-based experiments show the behavior of the Gini coefficient for retrievability under different types and lengths of queries, as well as different assumptions about the document length distribution in a collection.
The published scientific results should be reproducible, otherwise the scientific findings reported in the publications are less valued by the community. Several undertakings, like myExperiment, RunMyCode, or DIRECT, contribute to the availability of data, experiments, and algorithms. Some of these experiments and algorithms are even referenced or mentioned in later publications. Generally, research articles that present experimental results only summarize the used algorithms and data. In the better cases, the articles do refer to a web link where the code can be found. We give here an account of our experience with extracting the necessary data to possibly reproduce IR experiments. We also make considerations on automating this information extraction and storing the data as IR nanopublications which can later be queried and aggregated by automated processes, as the need arises.
Retrieval experiments produce plenty of data, like various experiment settings and experimental results, that are usually not all included in the published articles. Even if they are mentioned, they are not easily machine-readable. We propose the use of IR nanopublications to describe in a formal language such information. Furthermore, to support the unambiguous description of IR domain aspects, we present a preliminary IR ontology. The use of the IR nanopublications will facilitate the assessment and comparison of IR systems and enhance the degree of reproducibility and reliability of IR research progress.