1. Introduction

Overview of the PIR Track at FIRE 2024: Evaluation of Personalised Information Retrieval

Pranav Kasela

Marco Braga

Efrosyni Sokli

Gian Carlo Milanese

Georgios Peikos

Sandip Modha

Alessandro Raganato

Marco Viviani

Gabriella Pasi

0 0 Information and Knowledge Representation, Retrieval, and Reasoning (IKR3) Lab, Department of Informatics, Systems, and Communication (DISCo) University of Milano-Bicocca , Milan , Italy

This abstract provides a short overview of the first edition of the shared task on Personalised Information Retrieval (PIR) organized at the 16th Forum for Information Retrieval Evaluation (FIRE 2024). A more detailed discussion of the approaches used by the participating teams is available in the track overview paper. PIR 2024 consisted of two sub-tasks. The first sub-task aims to explore the personalisation in cQA based on user profiles, following the standard IR pipeline. The second one, instead, aims to investigate the personalisation in cQA based on user profiles using recent LLMs and prompt engineering. Although the tasks saw an enthusiastic response in registrations, with 10 teams requesting the dataset, only 1 team finally submitted the runs, and 2 of them submitted the working notes.

eol>Information Retrieval Question Answering Personalization Large Language Model

1. Introduction 2. Task Definition The first edition of PIR consisted of the following two sub-tasks: 2.1. Task 1: Standard IR

The cQA task will be tackled as a standard ad-hoc IR task, where the questions are going to be considered as the queries, and the collection, from which the answers will be retrieved, is composed by all the answers available in the dataset. In this case, personalization can be tackled using any standard or novel technique to create a user profile and inject it in the retrieval model. We plan to provide multiple baselines that utilize, as first stage retrievers, both classical approaches such as BM25, and neural approaches based on BERT-like models [ 6 ]. As a second stage, we plan to provide re-rankers, using cross-encoders, like Mono-T5 [ 7 ], for non-personalized baselines, and for personalized baselines, using of a mix of tags and historical documents related to the users and weighted according to their importance for the current question.

2.2. Task 2: Prompt-based IR

Diferently from the second stage of the standard IR task, the proposed prompt-based baselines personalise the results by using models like Phi [ 8 ] and GPT [ 9 ] with prompts similar to the following one: “To which degree between 0 and 1 does the document [DOCUMENT] answer the question [QUESTION], and is relevant to a user with the following profile [USER PROFILE]”, where the [USER PROFILE] is a series of user interests that are inferred from their activities and ordered according to their timestamp (most recent first) and importance.

More details about how this dataset was created can be found in the original resource paper [ 1 ].

3. Dataset and Evaluation

In this section, we discuss the datasets for each sub-task and the evaluation metrics used for each of them.

The PIR-FIRE will use data from StackExchange, a very popular community Question Answering (cQA) platform. The data is publicly available1 under a cc-by-sa 4.0 license. The dataset is composed of questions, and their answers, collected from fifty communities, which can be categorized under the large umbrella of humanistic communities. In Table 1 we report the basic statistics for the dataset. Specifically: document length, measured in the number of words, document score, which is the diference between the number of up- and down-votes assigned by the community; answers’ count, the number of answers given to a question; comments’ count, the number of user comments to a given question or answer; favorite count, that indicates the number of users that flagged the question as their favorite, showing their interest in that topic; tags count, the number of tags associated to the question by the asking user.

The dump is curated and merged to tackle the cQA task as a retrieval task. The PIR-FIRE test collections provide the traditional components used in IR experiments, i.e. access to a document collection, search topics, and corresponding relevance judgments. Regarding the judgments, we only consider relevant the single answer that is explicitly labelled as the best answer by the user who submitted the question. In addition, our PIR evaluation test collections [ 1 ] are accompanied by user-related information for modelling and introducing profiles in evaluation experiments. The user-related information includes the text and the number of views of documents they have generated, and in many cases also the tags associated with these documents, the date since they are registered on the website, the badges they obtained, their reputation score, and some times also their autobiography. The information collected as previously explained can be used for personalising and adapting the search process to the current user, e.g. by creating and exploiting personal user profiles. std 3.0.1. Evaluation setup We will provide the participants with several baselines, including keyword and dense-based representations of user profiles (anonymised information gathered about individual users) as part of our data collection. For this shared task, we will use traditional evaluation metrics in the IR literature that can be applied also to personalized search, Precision (P), Recall (R), Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), and (normalized) Discounted Cumulative Gain (nDCG).

4. Participation 5. Conclusion

A total of 10 teams registered across both sub-tasks. Only one team (Word Wizards) submitted runs for task 1. For both tasks, 2 teams submitted the working notes.

Table 2 shows the performance of our proposed baselines and submitter runs.

The Personalised Information Retrieval (PIR) track at FIRE’24 focus on the evaluation of Personalised Information Retrieval (PIR), which remains an important topic both in research and the development of practical applications.

In future eforts, we plan to address the challenges encountered by making datasets more manageable, enhancing promotion, and broadening the scope of personalization to include diverse tasks in Information Retrieval (IR) and possibly Natural Language Processing (NLP). By providing resources such as pre-trained models, smaller datasets, and novel tasks, we aim to encourage a stronger focus on personalization across varied domains. These steps will help attract a broader range of participants and methodologies, driving greater engagement and impact.

Declaration on Generative AI The author(s) have not employed any Generative AI tools.

[1]

Kasela ,

Braga , G. Pasi,

Perego , Se-pqa: Personalized community question answering , in: Companion Proceedings of the ACM on Web Conference 2024 , WWW '24, Association for Computing Machinery, New York, NY, USA, 2024 , p. 1095 - 1098 . URL: https://doi.org/10.1145/ 3589335.3651445. doi: 10 .1145/3589335.3651445.

[2]

Braga , Personalized large language models through parameter eficient fine-tuning techniques , in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2024 , pp. 3076 - 3076 .

[3]

Braga ,

Kasela ,

Raganato , G. Pasi, Synthetic data generation with large language models for personalized community question answering , arXiv preprint arXiv:2410.22182 ( 2024 ).

[4]

Kasela , G. Pasi,

Perego ,

Tonellotto , Desire-me: Domain-enhanced supervised information retrieval using mixture-of-experts , in: European Conference on Information Retrieval , Springer, 2024 , pp. 111 - 125 .

[5]

Salemi ,

Mysore ,

Bendersky ,

Zamani , Lamp: When large language models meet personalization , 2023 . arXiv: 2304 . 11406 .

[6]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[7]

Nogueira ,

Jiang ,

Pradeep ,

Lin , Document ranking with a pretrained sequence-to-sequence model , in: T. Cohn,

He , Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020 , Association for Computational Linguistics , Online, 2020 , pp. 708 - 718 . URL: https: //aclanthology.org/ 2020 .findings-emnlp. 63 . doi: 10 .18653/v1/ 2020 .findings-emnlp. 63 .

[8]

Abdin ,

S. A.

Jacobs ,

A. A.

Awan ,

Aneja ,

Awadallah ,

Awadalla ,

Bach ,

Bahree ,

Bakhtiari ,

Behl , et al., Phi-3 technical report: A highly capable language model locally on your phone , arXiv preprint arXiv:2404.14219 ( 2024 ).

[9]

Achiam ,

Adler ,

Agarwal ,

Ahmad ,

Akkaya ,

F. L.

Aleman ,

Almeida ,

Altenschmidt ,

Altman ,

Anadkat , et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 ( 2023 ).