On Biases in Information Retrieval Models and Evaluation

PhD Thesis
Aldo Lipani
Doctoral Dissertation
Publication year: 2018

The advent of the modern information technology has benefited society as the digitisation of content increased over the last half-century. While the processing capability of our species has remained unchanged, the information available to us has been notably increasing. In this overload of information, Information Retrieval (IR) has been playing a prominent role by developing systems capable of separating relevant information from the rest. This separation, however, is a difficult task rooted in the complexity of understanding of what is and what is not relevant. To manage this complexity, IR has developed a strong empirical nature, which has led to the development of grounded retrieval models, resulting in the development of retrieval systems empirically designed to be biased towards relevant information. However, other biases have been observed, which counteract retrieval performance. In this thesis, the reduction of retrieval systems to filters of information, or sampling processes, has allowed us to systematically investigate these biases.

We study biases manifesting in two aspects of IR research: retrieval models and retrieval evaluation. We start by identifying retrieval biases in probabilistic IR models and then develop new document priors to improve retrieval performance. Next, we discuss the accessibility bias of retrieval models, and for Boolean retrieval models we develop a mathematical framework of retrievability. For retrieval evaluation biases, we study how test collections are built using the pooling method and how this method introduces bias. Then, to improve the reliability of the evaluation, we first develop new pooling strategies to mitigate this bias at test collection build time and then, for two IR evaluation measures, Precision and Recall at cut-off (P@n and R@n), we develop new pool bias estimators to mitigate it at evaluation time.

Through a large scale experimentation involving up to 15 test collections, four IR evaluation measures and three bias measures, we demonstrate that including document priors based on verboseness improves the performance of probabilistic retrieval models; that the accessibility bias of Boolean retrieval models quickly worsens for conjunctive queries with the increase of the query length (while slightly improving for disjunctive queries); that the test collection bias can be lowered at test collection build time by pooling strategies inspired by a well-known problem in reinforcement learning, the multi-armed bandit problem; and that this bias can also be improved at evaluation time by analysing the runs participating in the pool. For this last point in particular, we show that for P@n, bias reduction is done by quantifying the potential of the new system against the pooled runs, and for R@n, this is done instead by simulating the absence of a pooled run from the set of pooled runs.

This thesis contributes to the IR field by giving a better understanding of relevance through the lens of biases in retrieval models and retrieval evaluation. The identification of these biases, and their exploitation or mitigation, leads to the development of better performing IR models and the improvement of the current IR evaluation practice.

A Systematic Approach to Normalization in Probabilistic Models

Journal article
Aldo Lipani, Thomas Roelleke, Mihai Lupu, Allan Hanbury
Publication year: 2018

Every information retrieval (IR) model embeds in its scoring function a form of term frequency (TF) quantification. The contribution of the term frequency is determined by the properties of the function of the chosen TF quantification, and by its TF normalization. The first defines how independent the occurrences of multiple terms are, while the second acts on mitigating the a priori probability of having a high term frequency in a document (estimation usually based on the document length). New test collections, coming from different domains (e.g. medical, legal), give evidence that not only document length, but in addition, verboseness of documents should be explicitly considered. Therefore we propose and investigate a systematic combination of document verboseness and length. To theoretically justify the combination, we show the duality between document verboseness and length. In addition, we investigate the duality between verboseness and other components of IR models. We test these new TF normalizations on four suitable test collections. We do this on a well defined spectrum of TF quantifications. Finally, based on the theoretical and experimental observations, we show how the two components of this new normalization, document verboseness and length, interact with each other. Our experiments demonstrate that the new models never underperform existing models, while sometimes introducing statistically significantly better results, at no additional computational cost.

Visual Pool: A Tool to Visualize and Interact with the Pooling Method

Conference paper
Aldo Lipani, Mihai Lupu, Allan Hanbury
SIGIR'17
Publication year: 2017

Every year more than 25 test collections are built among the main Information Retrieval (IR) evaluation campaigns. They are extremely important in IR because they become the evaluation praxis for the forthcoming years. Test collections are built mostly using the pooling method. The main advantage of this method is that it drastically reduces the number of documents to be judged. It does so at the cost of introducing biases, which are sometimes aggravated by non optimal configuration. In this paper we develop a novel visualization technique for the pooling method, and integrate it in a demo application named Visual Pool. This demo application enables the user to interact with the pooling method with ease, and develops visual hints in order to analyze existing test collections, and build better ones.

Spatio-temporal topsoil organic carbon mapping of a semi-arid Mediterranean region: The role of land use, soil texture, topographic indices and the influence of remote sensing data to modelling

Journal article
Calogero Schillaci, Marco Acutis, Luigi Lombardo, Aldo Lipani, Maria Fantappiè, Michael Märker, Sergio Saia
Science of The Total Environment, Volumes 601–602, Pages 821-832
Publication year: 2017

SOC is the most important indicator of soil fertility and monitoring its space-time changes is a prerequisite to establish strategies to reduce soil loss and preserve its quality. Here we modelled the topsoil (0–0.3 m) SOC concentration of the cultivated area of Sicily in 1993 and 2008. Sicily is an extremely variable region with a high number of ecosystems, soils, and microclimates. We studied the role of time and land use in the modelling of SOC, and assessed the role of remote sensing (RS) covariates in the boosted regression trees modelling. The models obtained showed a high pseudo-R2 (0.63–0.69) and low uncertainty (s.d. < 0.76 g C kg− 1 with RS, and < 1.25 g C kg− 1 without RS). These outputs allowed depicting a time variation of SOC at 1 arcsec. SOC estimation strongly depended on the soil texture, land use, rainfall and topographic indices related to erosion and deposition. RS indices captured one fifth of the total variance explained, slightly changed the ranking of variance explained by the non-RS predictors, and reduced the variability of the model replicates. During the study period, SOC decreased in the areas with relatively high initial SOC, and increased in the area with high temperature and low rainfall, dominated by arables. This was likely due to the compulsory application of some Good Agricultural and Environmental practices. These results confirm that the importance of texture and land use in short-term SOC variation is comparable to climate. The present results call for agronomic and policy intervention at the district level to maintain fertility and yield potential. In addition, the present results suggest that the application of RS covariates enhanced the modelling performance.

Fixed-Cost Pooling Strategies Based on IR Evaluation Measures

Conference paper
Aldo Lipani, Mihai Lupu, Joao Palotti, Florina Piroi, Guido Zuccon, Allan Hanbury
ECIR '17
Publication year: 2017

Recent studies have reconsidered the way we operationalise the pooling method, by considering the practical limitations often encountered by test collection builders. The biggest constraint is often the budget available for relevance assessments and the question is how best – in terms of the lowest pool bias – to select the documents to be assessed given a fixed budget. Here, we explore a series of 3 new pooling strategies introduced in this paper against 3 existing ones and a baseline. We show that there are significant differences depending on the evaluation measure ultimately used to assess the runs. We conclude that adaptive strategies are always best, but in their absence, for top-heavy evaluation measures we can continue to use the baseline, while for P@100 we should use any of the other non-adaptive strategies.

Fixed Budget Pooling Strategies based on Fusion Methods

Conference paper
Aldo Lipani, Mihai Lupu, Joao Palotti, Guido Zuccon, Allan Hanbury
Publication year: 2017

The empirical nature of Information Retrieval (IR) mandates strong experimental practices. The Cranfield/TREC evaluation paradigm represents a keystone of such experimental practices. Within this paradigm, the generation of relevance judgments has been the subject of intense scientific investigation. This is because, on one hand, consistent, precise and numerous judgements are key to reduce evaluation uncertainty and test collection bias; on the other hand, however , relevance judgements are costly to collect. The selection of which documents to judge for relevance (known as pooling) has therefore great impact in IR evaluation. In this paper, we contribute a set of 8 novel pooling strategies based on retrieval fusion methods. We show that the choice of the pooling strategy has significant effects on the cost needed to obtain an unbiased test collection; we also identify the best performing pooling strategy according to three evaluation measure.

The Solitude of Relevant Documents in The Pool

Conference paper
Aldo Lipani, Mihai Lupu, Evangelos Kanoulas, Allan Hanbury
CIKM '16 Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, October 2016, Pages 1989-1992
Publication year: 2016

Pool bias is a well understood problem of test-collection based benchmarking in information retrieval. The pooling method itself is designed to identify all relevant documents. In practice, `all' translates to `as many as possible given some budgetary constraints' and the problem persists, albeit mitigated. Recently, methods to address this pool bias for previously created test collections have been proposed, for the evaluation measure precision at cut-off (P@n). Analyzing previous methods, we make the empirical observation that the distribution of the probability of providing new relevant documents to the pool, over the runs, is log-normal (when the pooling strategy is fixed depth at cut-off). We use this observation to calculate a prior probability of providing new relevant documents, which we then use in a pool bias estimator that improves upon previous estimates of precision at cut-off. Through extensive experimental results, covering 15 test collections, we show that the proposed bias correction method is the new state of the art, providing the closest estimates yet when compared to the original pool.

The Impact of Fixed-Cost Pooling Strategies on Test Collection Bias

Conference paper
Aldo Lipani, Guido Zuccon, Mihai Lupu, Bevan Koopman, Allan Hanbury
ICTIR '16 Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval, September 2016, Pages 105-108
Publication year: 2016

In Information Retrieval, test collections are usually built using the pooling method. Many pooling strategies have been developed for the pooling method. Herein, we address the question of identifying the best pooling strategy when evaluating systems using precision-oriented measures in presence of budget constraints on the number of documents to be evaluated. As a quality measurement we use the bias introduced by the pooling strategy, measured both in terms of Mean Absolute Error of the scores and in terms of ranking errors. Based on experiments on 15 test collections, we conclude that, for precision-oriented measures, the best strategies are based on Rank-Biased Precision (RBP). These results can inform collection builders because they suggest that, under fixed assessment budget constraints, RBP-based sampling produces less biased pools than other alternatives.

The Curious Incidence of Bias Corrections in the Pool

Conference paper
Aldo Lipani, Mihai Lupu, Allan Hanbury
Advances in Information Retrieval: 38th European Conference on IR Research, ECIR 2016, At Padua, Italy, Volume: 9626
Publication year: 2016

Recently, it has been discovered that it is possible to mitigate the Pool Bias of Precision at cutoff (P@n) when used with the fixed-depth pooling strategy, by measuring the effect of the tested run against the pooled runs. In this paper we extend this analysis and test the existing methods on different pooling strategies, simulated on a selection of 12 TREC test collections. We observe how the different methodologies to correct the pool bias behave, and provide guidelines about which pooling strategy should be chosen.

Fairness in Information Retrieval

Conference paper
Aldo Lipani
ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '16), ACM, New York, NY, USA, 2015
Publication year: 2016

The offline evaluation of Information Retrieval (IR) systems is performed through the use of test collections. A test collection, in its essence, is composed of: a collection of documents, a set of topics and, a set of relevance assessments for each topic, derived from the collection of documents. Ideally, for each topic, all the documents of the test collection should be judged, but due to the dimensions of the collections of documents, and their exponential growth over the years, this practice soon became impractical. Therefore, early in IR history, this problem has been addressed through the use of the pooling method. The pooling method consists of optimizing the relevance assessment process by pooling the documents retrieved by different search engines following a particular pooling strategy. The most common one consists on pooling the top d documents of each run. The pool is constructed from systems taking part in a challenge for which the collection was made, at a specific point in time, after which the collection is generally frozen in terms of relevance judgments. This method leads to a bias called pool bias, which is the effect that documents that were not selected in the pool created from the original runs will never be considered relevant. Thereby, this bias affects the evaluation of a system that has not been part of the pool, with any IR evaluation measures, making the comparison with pooled systems unfair.

IR measures have evolved over the years and become more and more complex and difficult to interpret. Witnessing a need in industry for measures that `make sense', I focus on the problematics of the two fundamental IR evaluation measures, Precision at cut-off P@n and Recall at cut-off R@n. There are two reasons to consider such `simple' metrics: first, they are cornerstones for many other developed metrics and, second, they are easy to understand by all users. This ``understandability'' of the IR metrics has drawn moderate attention from our community recently. To the eyes of a practitioner, these two evaluation measures are interesting because they lead to more intuitive interpretations like, how much time people are reading useless documents (low precision), or how many relevant documents they are missing (low recall). But this last interpretation, due to the fact that recall is inversely proportional to the number of relevant documents per topic, is very difficult to be addressed if to be judged is just a portion of the collection of documents, as it is done when using the pooling method. To tackle this problem, another kind of evaluation has been developed, based on measuring how much an IR system makes documents accessible. Accessibility measures can be seen as a complementary evaluation to recall because they provide information on whether some relevant documents are not retrieved due to an unfairness in accessibility.

The main goal of this Ph.D. is to increase the stability and reusability of existing test collections, when to be evaluated are systems in terms of precision, recall, and accessibility. The outcome will be: the development of a novel estimator to tackle the pool bias issue for P@n, and R@n, a comprehensive analysis of the effect of the estimator on varying pooling strategies, and finally, to support the evaluation of recall, an analytic approach to the evaluation of accessibility measures.