1. Introduction

IIR

Towards an Information Retrieval Evaluation Library

Discussion Paper

Elias Bassani

0 1 0 Consorzio per il Trasferimento Tecnologico - C2T , Milan , Italy 1 University of Milano-Bicocca , Milan , Italy

2022

12 0000 0001

This manuscript discusses our ongoing work on ranx, a Python evaluation library for Information Retrieval. First, we introduce our work, summarize the already available functionalities, show the user-friendly nature of our tool through code snippets, and briefly discuss the technologies we relied on for the implementation and their advantages. Then, we present the upcoming features, such as several Metasearch algorithms, and introduce the long-term goals of our project.

eol>Information Retrieval Evaluation Comparison Metasearch Fusion

1. Introduction 2. Overview

In this section, we present the main functionalities ranx provides, show its user-friendly nature through some code snippets, and discuss its implementation and the advantages brought by the employed technologies. More details and examples are available in the oficial repository.

2.1. Qrels and Run

First, ranx provides a convenient way of managing the data needed for evaluating and comparing diferent retrieval models: the query relevance judgments (qrels) and ranked lists of documents retrieved for those queries by the systems (runs). ranx implements two custom classes for these kinds of data: Qrels and Run. In particular, data can be loaded from Python dictionaries and Pandas DataFrames [ 19 ] or read from TREC-style files and JSON files. Moreover, ranx integrates seamlessly with ir-datasets [ 20 ], allowing the users to load qrels for several Information Retrieval datasets, such as those from TREC’s challenges2, BEIR [ 21 ], and MS MARCO [ 22 ]. Figure 1 shows the standard way of creating Qrels and Run instances. ranx takes care of sorting the result lists so that the user does not have to think about it. To learn more about Qrels and Run, we invite the reader to follow our online Jupyter Notebook3.

2.2. Metrics, Evaluation, and Comparison

ranx provides the most commonly used ranking evaluation metrics4 such as Reciprocal Rank, Average Precision, and Normalized Discounted Cumulative Gain [ 3 ]. These metrics can be used to evaluate a run in a single line of code, as depicted in Figure 2. As the figure shows, ranx allows the user to provide one or multiple metrics and define cut-ofs using a convenient syntax. Additional information can be found online5.

ranx also ofers functionalities to compare runs and perform statistical tests. As shown in Figure 3, by providing the query relevance judgments and a list of runs and defining the desired 2https://trec.nist.gov 3https://colab.research.google.com/github/AmenRa/ranx/blob/master/notebooks/2_qrels_and_run.ipynb 4A complete list of the implemented metrics can be found here: https://github.com/AmenRa/ranx#metrics 5https://colab.research.google.com/github/AmenRa/ranx/blob/master/notebooks/3_evaluation.ipynb metrics, the compare function performs a comparison of the runs. It returns a Report instance, which stores the information produced by the compare function and can be printed as in Figure 3 or exported as a LATEX table, ready for a scientific publication. The code underlying Table 1 was generated by ranx. To learn more about comparing diferent runs, we invite the reader to follow our online Jupyter Notebook6. 6https://colab.research.google.com/github/AmenRa/ranx/blob/master/notebooks/4_comparison_and_report.ipynb

2.3. Backend

In addition to its user-friendly interface, ranx is also very eficient due to its Numba-based implementation. Numba[ 14 ] is a just-in-time[ 15 ] compiler for Python and NumPy[ 16, 17, 18 ] that translates and compiles for-loop-based code to high-speed vector operations and allows for automatic parallelization, which is very handy on modern multi-core CPUs. Almost every operation performed by ranx relies on Numba-compiled code. The internal data structures used by Qrels and Run and all the evaluation metrics provided by ranx are built on top of Numba. Our implementation allows for conducting evaluations and comparisons much faster than other popular Python evaluation libraries for Information Retrieval. Table 2 reports the execution time of diferent metrics in ranx and pytrec_eval, a Python wrapper for trec_eval, the standard Information Retrieval evaluation library.

3. Upcoming Features

We are currently implementing several Metasearch [ 23 ] algorithms, such as comb_min [ 24 ], comb_max [ 24 ], comb_med [ 24 ], comb_anz [ 24 ], comb_mnz [ 24 ], comb_sum [ 24 ], comb_gmnz [ 25 ], RRF [ 26 ], MAPFuse [27], ISR [28], Log_ISR [28], LogN_ISR [28], and many more. Our goal is to ofer a Python implementation for all those methods with a standardized interface. Moreover, we want to provide a working and easy-to-use implementation of those models that could serve as baselines for researchers working on Metasearch algorithms. Moreover, we argue young researchers in the Deep Learning-based Information Retrieval era have little knowledge regarding Metasearch methods as they often rely on the weighted sum to fuse lexical matching scores, such as those computed by BM25 [29], and semantic matching scores computed by Transformer-based [30] rankers [31]. We hope that our work can stimulate researchers to explore diferent fusion approaches. As many Metasearch algorithms require to be tuned, we are also working on an auto-tune functionality that takes care of trying diferent hyper-parameters configurations and finding the best performing one with no user efort.

4. Conclusion and Long-term Goals

To conclude our discussion, we introduce the long-term goals of our library. Besides adding more metrics and other Metasearch methods, we plan to build a companion repository for storing runs of state-of-the-art models accompanied by rich metadata for searching and indexing. By integrating this online repository with ranx, we aim to allow researchers to download pre-computed runs and compare the results of their models with those of state-of-the-art approaches in just a few seconds. We think such functionality could help accelerate research in Information Retrieval, allowing researchers to rapidly find appropriate baselines and avoiding time-consuming and error-prone tasks entirely, such as re-implementing or re-training complex retrieval models from scratch. Moreover, sharing runs of state-of-the-art models could promote virtuous behaviors and transparency and reduce electricity consumption and pollution. [27] D. Lillis, L. Zhang, F. Toolan, R. W. Collier, D. Leonard, J. Dunnion, Estimating probabilities for efective data fusion, in: F. Crestani, S. Marchand-Maillet, H. Chen, E. N. Efthimiadis, J. Savoy (Eds.), Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, Geneva, Switzerland, July 1923, 2010, ACM, 2010, pp. 347–354. URL: https://doi.org/10.1145/1835449.1835508. doi:10. 1145/1835449.1835508. [28] A. Mourão, F. Martins, J. Magalhães, Multimodal medical information retrieval with unsupervised rank fusion, Comput. Medical Imaging Graph. 39 (2015) 35–45. URL: https: //doi.org/10.1016/j.compmedimag.2014.05.006. doi:10.1016/j.compmedimag.2014.05. 006. [29] S. E. Robertson, S. Walker, Some simple efective approximations to the 2-poisson model for probabilistic weighted retrieval, in: Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Dublin, Ireland, 3-6 July 1994 (Special Issue of the SIGIR Forum), ACM/Springer, 1994. doi:10. 1007/978-1-4471-2099-5\_24. [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017. [31] J. Lin, R. Nogueira, A. Yates, Pretrained Transformers for Text Ranking: BERT and Beyond, Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers, 2021. URL: https://doi.org/10.2200/S01123ED1V01Y202108HLT053. doi:10.2200/ S01123ED1V01Y202108HLT053.

[1]

Harman , Information Retrieval Evaluation, Synthesis Lectures on Information Concepts , Retrieval, and Services, Morgan & Claypool Publishers, 2011 .

[2]

Sanderson , Test collection based evaluation of information retrieval systems , Found. Trends Inf. Retr . 4 ( 2010 ) 247 - 375 .

[3]

Järvelin ,

Kekäläinen , Cumulated gain-based evaluation of IR techniques , ACM Trans. Inf. Syst . 20 ( 2002 ) 422 - 446 .

[4]

Voorhees ,

Harman , Experiment and evaluation in information retrieval , 2005 .

[5]

Macdonald ,

Tonellotto , Declarative experimentation in information retrieval using pyterrier , in: ICTIR, ACM, 2020 , pp. 161 - 168 .

[6]

Macdonald ,

Tonellotto , S. MacAvaney, I. Ounis, Pyterrier: Declarative experimentation in python from BM25 to dense retrieval , in: CIKM, ACM, 2021 , pp. 4526 - 4533 .

[7]

C. V.

Gysel , M. de Rijke, Pytrec_eval: An extremely fast python interface to trec_eval , in: SIGIR, ACM, 2018 , pp. 873 - 876 .

[8]

J. R. M.

Palotti ,

Scells , G. Zuccon, Trectools: an open-source python library for information retrieval practitioners involved in trec-like campaigns , in: SIGIR, ACM, 2019 , pp. 1325 - 1328 .

[9]

Breuer ,

Ferro ,

Maistro , P. Schaer, repro_eval: A python interface to reproducibility measures of system-oriented IR experiments , in: ECIR (2) , volume 12657 of Lecture Notes in Computer Science, Springer, 2021 , pp. 481 - 486 .

[10]

Lucchese ,

C. I.

Muntean ,

F. M.

Nardini ,

Perego ,

Trani , Rankeval: Evaluation and investigation of ranking models , SoftwareX 12 ( 2020 ) 100614 .

[11]

Lucchese ,

C. I.

Muntean ,

F. M.

Nardini ,

Perego ,

Trani , Rankeval: An evaluation and analysis framework for learning-to-rank solutions , in: SIGIR, ACM, 2017 , pp. 1281 - 1284 .

[12]

Bassani , ranx: A blazing-fast python library for ranking evaluation and comparison , in: M. Hagen , S.

Verberne , C.

Macdonald , C.

Seifert , K.

Balog , K.

Nørvåg , V. Setty (Eds.), Advances in Information Retrieval - 44th European Conference on IR Research , ECIR 2022 , Stavanger, Norway, April 10-14 , 2022 , Proceedings, Part

, volume 13186 of Lecture Notes in Computer Science, Springer, 2022 , pp. 259 - 264 . URL: https://doi.org/10.1007/ 978-3- 030 -99739-7_ 30 . doi: 10 .1007/978-3- 030 -99739-7\_ 30 .

[13]

Abras ,

Maloney-Krichmar ,

Preece , et al., User-centered design , Bainbridge, W. Encyclopedia of Human-Computer Interaction . Thousand Oaks: Sage Publications 37 ( 2004 ) 445 - 456 .

[14]

S. K.

Lam ,

Pitrou ,

Seibert , Numba: a llvm-based python JIT compiler, in: LLVM@SC , ACM , 2015 , pp. 7 : 1 - 7 : 6 .

[15]

Aycock , A brief history of just-in-time , ACM Comput. Surv . 35 ( 2003 ) 97 - 113 .

[16]

T. E.

Oliphant , A guide to NumPy, volume 1 ,

Trelgol

Publishing USA , 2006 .

[17]

S. van der

Walt ,

S. C.

Colbert , G. Varoquaux, The numpy array: A structure for eficient numerical computation , Comput. Sci. Eng . 13 ( 2011 ) 22 - 30 .

[18]

C. R.

Harris ,

K. J.

Millman , S. van der Walt , R. Gommers,

Virtanen ,

Cournapeau ,

Wieser ,

Taylor , S. Berg,

N. J.

Smith ,

Kern ,

Picus ,

Hoyer , M. H. van Kerkwijk , M.

Brett , A.

Haldane , J. F.

del Río , M.

Wiebe , P.

Peterson , P.

Gérard-Marchant , K.

Sheppard , T.

Reddy , W.

Weckesser , H.

Abbasi , C.

Gohlke , T. E.

Oliphant , Array programming with numpy , Nat . 585 ( 2020 ) 357 - 362 .

[19]

McKinney , et al., pandas: a foundational python library for data analysis and statistics, Python for high performance and scientific computing 14 ( 2011 ) 1 - 9 .

[20]

MacAvaney ,

Yates ,

Feldman ,

Downey ,

Cohan ,

Goharian , Simplified data wrangling with ir_datasets , in: SIGIR, ACM, 2021 , pp. 2429 - 2436 .

[21]

Thakur ,

Reimers ,

Rücklé ,

Srivastava , I. Gurevych , BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models , in: J. Vanschoren , S. Yeung (Eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 ,

NeurIPS

Datasets and Benchmarks 2021 , December 2021 , virtual, 2021 . URL: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/ 65b9eea6e1cc6bb9f0cd2a47751a186f-Abstract-round2. html .

[22]

Nguyen ,

Rosenberg ,

Song ,

Gao ,

Tiwary ,

Majumder , L. Deng, MS MARCO: A human generated machine reading comprehension dataset , in: T. R. Besold , A. Bordes , A. S. d'Avila Garcez , G. Wayne (Eds.), Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016 ), Barcelona, Spain, December 9, 2016 , volume 1773 of CEUR Workshop Proceedings, CEUR-WS.org , 2016 . URL: http://ceur-ws. org/ Vol- 1773 /CoCoNIPS_2016_paper9.pdf.

[23]

J. A.

Aslam ,

M. H.

Montague , Models for metasearch , in: W. B. Croft , D. J.

Harper , D. H.

Kraft , J. Zobel (Eds.), SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, September 9 - 13 , 2001 , New Orleans, Louisiana, USA, ACM, 2001 , pp. 275 - 284 . URL: https://doi.org/10.1145/ 383952.384007. doi: 10 .1145/383952.384007.

[24]

E. A.

Fox ,

J. A.

Shaw , Combination of multiple searches , in: TREC , volume 500 -215 of NIST Special Publication, National Institute of Standards and Technology (NIST) , 1993 , pp. 243 - 252 .

[25]

J. H.

Lee , Analyses of multiple evidence combination , in: SIGIR, ACM, 1997 , pp. 267 - 276 .

[26]

G. V.

Cormack ,

C. L. A.

Clarke ,

Büttcher , Reciprocal rank fusion outperforms condorcet and individual rank learning methods , in: SIGIR, ACM, 2009 , pp. 758 - 759 .