1. Introduction

Towards Result Delta Prediction Based on Knowledge Deltas for Continuous IR Evaluation

Gabriela Gonzalez-Saez

Alaa El-Ebshihy

Tobias Fink

Petra Galuščáková

Florina Piroi

David Iommi

Lorraine Goeuriot

Philippe Mulhem

0 0 , LIG , Grenoble , France 1 Research Studios Austria, Data Science Studio , Vienna, AT

The continuous evaluation of Information Retrieval Systems requires comparing IR systems both one to another, but also across collections, in other words across diferent evaluation environments (test collection and evaluation metrics). These evaluation environments may also be evolutionary versions of some given evaluation environment. In this work, we propose a methodology to measure and understand the impact the diferences between test collection representations (i.e. knowledge delta, Δ) has on system performance, and we look at the diferences in their outputs (i.e. result delta, ℛΔ). We present initial experiments with various text representations on the TREC 2004 Robust Collection, and look at the relation between the Δ and the ℛΔ.

eol>Continuous Evaluation Evolving Test Collections Knowledge Delta Result Delta

1. Introduction

means of defining Knowledge Delta ( Δ) and observing its impact on the ℛΔ. In our view, Δ for IR is a combination of a document representation delta, Δ, and a query representations delta, Δ, both defined as diference functions between pairs of text sequence representations.

This paper proposes a study that looks at how various simple text representations to quantify Δ and their impact on ℛΔ. Initial experiments are performed on the TREC 2004 Robust Collection [ 3 ]. As this collection stores the publishing time for each document, we consider it to be an evolving collection. That is, we can simulate the conditions of an IR system that has to provide answers to queries, answers extracted from a set of documents that changes over time.

2. Methodology

Mothe [ 4 ] analysed diferent approaches to understand the efectiveness of IR systems, focusing on studying the efectiveness with respect to the query and IR system parameters. In our work, we are interested in understanding the change of the IR systems performance with respect to the change of the document collection in addition to the query, in a way to predict the change in performance of the IR system for an evolving test collection. Inspired by [ 4 ], we aim to use the document representations as features for the document collection and find the correlation between the features of document collection and the change in IR system performance. Test collection diference, Δ: We define the Δ as a quantifiable value of the diferences between document representations, which may be more or less complex: bag of words, TF-IDF [ 5 ], topic detection methods (e.g. Latent Dirichlet Allocation [ 6 ] and conceptual embeddings [ 7 ]) and neural networks language models (e.g. Word2Vec [ 8 ] and BERT [ 9 ]). Any of these representations, or a combination of them, may contribute to generate the document collection representation which can then be used to quantify Δ and predict the ℛΔ. Performance impact, ℛΔ: We define ℛΔ as the absolute diference in the IR system performance in two EEs: consider (, ) as the performance of systems evaluated in evaluation environment with metric , we compute ℛΔ as (1, 1) − (1, 2). Prediction model, (Δ ∼ ℛ Δ): We propose to understand the impact of Δ on ℛΔ by building a model that predicts ℛΔ from Δ. We will, first, observe the correlation between Δ and ℛΔ using diferent text representation methods as Δ. Then, we will build a prediction model based on these observations. Finally, we will analyse the impact of the Δ elements on the prediction of the ℛΔ by feature selection techniques [ 10 ]. Dataset: We measure Δ and ℛΔ from an evolving test collection as an example of documents changing in a real corpus. The evolving test collection is built by creating shards of a classical test collection [ 11 ] that contains timestamped documents. We use these timestamps to assign documents, according to their temporal order, to shards and to define fixed percentages of corpus overlap to control the evolution.

Initial Experiment: We evaluate pyterrier BM25 system [ 12 ] in an evolving test collection created from Robust [ 3 ] using the MAP metric. We create 41 using 90% document overlaps between successive shards, with full set of topics. As text representations, we test two features used in query performance prediction [ 13 ]: Averaged Term Weight Variability (avVAR) [ 14 ] and Averaged Collection Query Similarity (avSCQ) [ 14 ]. We compare EEs with 50% of overlap (e.g. 1 vs. 6, 2 vs. 7, etc.). Figure 1 presents changes in the MAP score (ℛΔ) compared with the Δ calculated as the changes in the selected feature values: avVAR in (a) and avSCQ in (b). The pearson correlation between the Δ MAP and the features is 0.5 and 0.12 for the avVAR and avSCQ, respectively. These results confirm that the changes in Δ have a considerable efect ℛΔ values. Moreover, they show that the efect might substantially difer for diferent features and over time.

3. Discussion and Future Work

We propose the definition of Knowledge Delta (Δ) for the elements of the EEs. As a first attempt to quantify the Δ and its impact on the Result Delta (ℛΔ), we use two simple text representation metrics, avVAR and avSCQ. We experiment on an evolving test collection which is built by using the timestamps from the Robust test collection. The initial results show a correlation between Δ and the ℛΔ and thus provide justification for our approach. These results motivate us to build a prediction model (Δ ∼ ℛ Δ) that can predict the change of the performance of an IR systems using the Δ and also to quantify Δusing diferent text representations (see Section 2). We either plan to construct a machine learning model that assumes Δ as input feature to predict ℛΔ or to use time series [15] techniques to predict significant changes in Δ, which lead to changes in the performance of the IR system. Moreover, we plan to define other types of Δ and ℛΔ, such as quantifying the diferences in query representations (Δ) and apply them in the LongEval collection [16]. This will contribute to understand the impact of the Δ on other ℛΔ, including ℛΔ and ℛΔ.

Acknowledgments

This work is supported by ANR Kodicare bi-lateral project, grant ANR-19-CE23-0029 of the French Agence Nationale de la Recherche, and by the Austrian Science Fund FWF grant I4471-N. similarity and variability evidence, in: Advances in Information Retrieval: 30th European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30-April 3, 2008. Proceedings 30, Springer, 2008, pp. 52–64. [15] C. Chatfield, The analysis of time series: an introduction, Chapman and hall/CRC, 2003. [16] P. Galuščáková, R. Deveaud, G. Gonzalez-Saez, P. Mulhem, L. Goeuriot, F. Piroi, M. Popel, Longeval-retrieval: French-english dynamic test collection for continuous web search evaluation, arXiv preprint arXiv:2303.03229 (2023).

[1]

Sanderson , Test collection based evaluation of information retrieval systems , Now Publishers Inc, 2010 .

[2]

G. N.

González-Sáez ,

Mulhem ,

Goeuriot , Towards the evaluation of information retrieval systems on evolving datasets with pivot systems , in: K. S. Candan,

Ionescu ,

Goeuriot ,

Larsen ,

Müller ,

Joly ,

Maistro ,

Piroi , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction , Springer International Publishing, Cham, 2021 , pp. 91 - 102 .

[3]

E. M.

Voorhees , The trec 2005 robust track , in: ACM SIGIR Forum , volume 40 , ACM New York, NY, USA, 2006 , pp. 41 - 48 .

[4]

Mothe , Analytics methods to understand information retrieval efectiveness-a survey , Mathematics 10 ( 2022 ).

[5]

Salton ,

Wong ,

C. S.

Yang , A vector space model for automatic indexing , Commun. ACM 18 ( 1975 ) 613 - 620 .

[6]

Jelodar ,

Wang ,

Yuan ,

Feng ,

Jiang ,

Li ,

Zhao , Latent dirichlet allocation (LDA) and topic modeling: models, applications, a survey, Multim . Tools Appl . 78 ( 2019 ) 15169 - 15211 .

[7]

Abdulahhad , Concept embedding for information retrieval , in: G. Pasi,

Piwowarski ,

Azzopardi , A . Hanbury (Eds.), Advances in Information Retrieval - 40th European Conference on IR Research , ECIR 2018 , Grenoble, France, March 26-29, 2018 , Proceedings, volume 10772 of Lecture Notes in Computer Science, Springer, 2018 , pp. 563 - 569 .

[8]

Mikolov , I. Sutskever,

Chen ,

G. S.

Corrado ,

Dean , Distributed representations of words and phrases and their compositionality , in: C. J. C. Burges , L.

Bottou , Z.

Ghahramani , K. Q.

Weinberger (Eds.), Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8 , 2013 ,

Lake

Tahoe , Nevada, United States, 2013 , pp. 3111 - 3119 .

[9]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: pre-training of deep bidirectional transformers for language understanding , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis , MN, USA, June 2-7, 2019 , Volume 1 (Long and Short Papers), Association for Computational Linguistics , 2019 , pp. 4171 - 4186 .

[10]

Déjean ,

R. T.

Ionescu ,

Mothe ,

M. Z.

Ullah , Forward and backward feature selection for query performance prediction , in: Proceedings of the 35th annual ACM symposium on applied computing , 2020 , pp. 690 - 697 .

[11]

Ferro ,

Kim ,

Sanderson , Using collection shards to study retrieval performance efect sizes , ACM Transactions on Information Systems (TOIS) 37 ( 2019 ) 1 - 40 .

[12]

Macdonald ,

Tonellotto , Declarative experimentation in information retrieval using pyterrier , in: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval , 2020 , pp. 161 - 168 .

[13]

Hauf , Predicting the efectiveness of queries and retrieval systems , in: SIGIR Forum , volume 44 , 2010 , p. 88 .

[14]

Zhao ,

Scholer ,

Tsegay , Efective pre-retrieval query performance prediction using