A Systematic Approach to Normalization in Probabilistic Models

Journal article
Aldo Lipani, Thomas Roelleke, Mihai Lupu, Allan Hanbury
Publication year: 2018

Every information retrieval (IR) model embeds in its scoring function a form of term frequency (TF) quantification. The contribution of the term frequency is determined by the properties of the function of the chosen TF quantification, and by its TF normalization. The first defines how independent the occurrences of multiple terms are, while the second acts on mitigating the a priori probability of having a high term frequency in a document (estimation usually based on the document length). New test collections, coming from different domains (e.g. medical, legal), give evidence that not only document length, but in addition, verboseness of documents should be explicitly considered. Therefore we propose and investigate a systematic combination of document verboseness and length. To theoretically justify the combination, we show the duality between document verboseness and length. In addition, we investigate the duality between verboseness and other components of IR models. We test these new TF normalizations on four suitable test collections. We do this on a well defined spectrum of TF quantifications. Finally, based on the theoretical and experimental observations, we show how the two components of this new normalization, document verboseness and length, interact with each other. Our experiments demonstrate that the new models never underperform existing models, while sometimes introducing statistically significantly better results, at no additional computational cost.

Spatio-temporal topsoil organic carbon mapping of a semi-arid Mediterranean region: The role of land use, soil texture, topographic indices and the influence of remote sensing data to modelling

Journal article
Calogero Schillaci, Marco Acutis, Luigi Lombardo, Aldo Lipani, Maria Fantappiè, Michael Märker, Sergio Saia
Science of The Total Environment, Volumes 601–602, Pages 821-832
Publication year: 2017

SOC is the most important indicator of soil fertility and monitoring its space-time changes is a prerequisite to establish strategies to reduce soil loss and preserve its quality. Here we modelled the topsoil (0–0.3 m) SOC concentration of the cultivated area of Sicily in 1993 and 2008. Sicily is an extremely variable region with a high number of ecosystems, soils, and microclimates. We studied the role of time and land use in the modelling of SOC, and assessed the role of remote sensing (RS) covariates in the boosted regression trees modelling. The models obtained showed a high pseudo-R2 (0.63–0.69) and low uncertainty (s.d. < 0.76 g C kg− 1 with RS, and < 1.25 g C kg− 1 without RS). These outputs allowed depicting a time variation of SOC at 1 arcsec. SOC estimation strongly depended on the soil texture, land use, rainfall and topographic indices related to erosion and deposition. RS indices captured one fifth of the total variance explained, slightly changed the ranking of variance explained by the non-RS predictors, and reduced the variability of the model replicates. During the study period, SOC decreased in the areas with relatively high initial SOC, and increased in the area with high temperature and low rainfall, dominated by arables. This was likely due to the compulsory application of some Good Agricultural and Environmental practices. These results confirm that the importance of texture and land use in short-term SOC variation is comparable to climate. The present results call for agronomic and policy intervention at the district level to maintain fertility and yield potential. In addition, the present results suggest that the application of RS covariates enhanced the modelling performance.