-

ECNU at CLEF PIR 2018 : Evaluation of Personalized Information Retrieval

Qingchun Bai

Jiayi Chen

jycheng@ica.stc.sh.cn 1

Qinmin Hu

Liang He

1 0 Department of Computer Science, Ryerson University , Toronto , Canada 1 School of Computer Science and Software Engineering, East China Normal University , Shanghai , China

Personalized Information Retrieval (PIR) is an e ective solution when purposes of queries are issued but users receive the same results. The PIR-CLEF 2018 task aims to explore the methods and evaluations of PIR. By analyzing the provided data we generate query level and session level baselines. We compare baselines and extended models we propose, and experiment results show that insu cient relevance information has a negative impact on the performance of models and evaluation process. Since personalization ranking based on typical users interests is not e ective in reality, especially when the results of relevance feedback is not satisfactory, we consider that the PIR task should not only relate to context, but to the various search intentions. We propose several suggestions about data and evaluation process.

Personalized Information Retrieval Query Expansion Data Analysis

The PIR-CLEF 2018 task aims to explore the methods and evaluations of Personalized Information Retrieval (PIR). PIR has drawn great attention to help understand user's behaviors in the interaction with IR systems. Personalized search is an e ective solution when queries purposes are issued but users received the same results.

Existing works [ 3,5,4 ] have proved that personalized ranking can be considered as a good solution for PIR task. The foundation and key of personalized ranking service is how to obtain persons attempt to frame user's interest model. Various penalization strategies are carried out, for example, In [ 7 ], a method is discussed to identify user's interest automatically, which based on the assumption that a user's general preference may help the search engine disambiguate the true intention of a query. The approach described in [ 8 ] considered a user's prior interactions with a wide variety of content to personalize that user's current web search. More recently, in [ 1 ], a dynamic personalized ranking model is proposed to recommend the most relevant information which combined di erent sources of information.

For another piece of research, the research focus on understanding the user's intent of search session information[ 2,6,9,10 ], results show that it is possible to understand the user's intent, since people all have intentions in the process of seeking information and also have reasons to believe these information seeking intentions.

These prior works on personalized information retrieval have focused on independent issues with independent data. A few of them also have focused on the analysis of required data and evaluation of personalized ranking. In fact, we consider that personalized ranking based on typical user's interests is not e ective in reality, especially when the results of relevance feedback are not good, the re-ranking model cannot achieve the desired result.

Therefore, in this study, we aim to explore the potential of the task and understand both the data and evaluation of the personalized search and ask the following research questions: { If the provided PIR data can satis ed the task? { How to achieve personalization, and what kind of data is needed to support this research? { How to evaluate it? To achieve this aims, this paper is organized as follows. In section 2, we brie y review and analysis the current PIR data. In section 3, we describes our baselines method to the data in detail. In section 4, experiments and results are presented. Finally, we discuss the task and evaluation in section 5. 2

Data Review and Analysis

We present the statistics about dataset in this section, the review of current dataset described in Section 2.1, and explores the potenital of the dataset. Then we analysis the query session given by the data in Section 2.2.

Statistics about Dataset In this section, we brie y review current dataset of PIR task and provide a comprehensive analysis on this task. In PIR-CLEF 2018, data are provided with six csv les including information below: { the search tasks (sessions) of ten users; { the queries submitted by all users and all documents returned by ClueWeb

API; { relevance scores labeled by users and original ranks of documents; { personal information like gender and job; { remarks written by users; { statistical information of terms in queries.

Statistics about Sessions A user can submit several queries in a query session. These queries are aiming at di erent objectives. To nd the objective of the user, we gather all queries as one query to represent the objective. We need to submit this query to the API and evalutate the performance.

Methods

The users submit their queries to the ClueWeb API3 and annotate whether the returned documents are relevant. The users divide relevance into four grades: relevant, somewhat relevant, not relevant and o topic with scores ranging from four to one. According to the description above, we de ne that documents are relevant to the query only when those scores are four. Figure 1 shows the framework of the personalized ranking part.

Query(Q)

ClueWeb API Personalized Ranking

Personalized Ranking

User(U)

Query Session(QS) Query(Q) Testing Sample (Q,U)

Topic-Sensitive

User model Query Expansion

Personalized

Ranking Score(Q,U,D)

Baselines We propose two baselines: query-level baseline and session-level baseline. In query-level baseline, we evaluate each query independently. While in session-level baseline, we collect all relevant documents of queries in the session and consider them as the relevant documents of the search task. We evaluate the performance of each query on its search task. We also assume that queries belonging to one session represent di erent aspects of user's need. So we sum up all queries in a session to one and submit it to the ClueWeb API. We evaluate and display the performance of each session baseline in Table 1. Query Expansion We rst take user's feedback into account. After the user labels the documents with relevance score, we choose ten words with highest frequency in relevant documents as expansion words and add them to the original session-level query. We submit the new queries to the API and evaluate the performance of this method. Then we assume top-20 documents returned from API are relevant and select ten most frequent words in documents as expansion words. The rst method is listed in Table 2 with su x \RF" while the second one is named with su x \PRF". 3 http://clueweb.adaptcentre.ie/WebSearcher/search?query=queryString &page=pagenumber Topic-Sensitive User model We propose a language modeling approach to personalized search based on users' search behavior and preference. To capture the user's searching interesting and implicit purpose, we propose to use LDAbased approach for simulating users which does not merely focus on simulating the search behaviours but also considers search sessions of the task. 4

Experiments and Results

4.1

Performance of Each Query Session

We further make a comparison between baselines and our methods in Table 2. In query-level evaluation, our methods are worse than baselines. In session-level evaluation, relevance feedback method performs better than baselines because we can obtain users' interests from relevant documents. However pseudo relevance feedback and LDA methods get worse performance than baseline.

It is within our expectation that all query-level evaluations are worse than that of baseline. We think this phenomenon is caused by the lack of relevant documents. In the provided data, each user only labels about twenty documents whose original ranks range from 0 to 100 and few of them are relevant. Assuming the users get 100 documents returned per query, eighty percent of relevance information is lost. In this scenario, any document not occurring in this list is considered as irrelevant which means a relevant document can be annotated as irrelevant from the perspective of user. Insu cient relevance information even make us hard to evaluate on certain queries. In session 154, 176 and 204, all documents are irrelevant so that even though we nd the potentially relevant documents, we cannot know whether they are relevant from the angle of users.

We also think the evaluation process should be upgraded. Unlike existing search task like TREC tracks, personalized information retrieval focuses more on individual di erences. In PIR-CLEF 2018, some users receive the same task but the queries they submit are di erent. For example, user 8,11 and 12 receive the same task of traveling, but their queries are about Dublin, Tokyo and Barcelona. The individual di erences are expressed by queries so we think this task is still an ad-hoc retrieval task. So if we want to focus on individual di erences, we need more users to join in the data collection.

We suggest that the complete logs of users can be provided. By analyzing the relevant documents annotated by users, we get an improvement in session aspect as listed in Table 2. However our method still can be improved. In this task we are provided with users' actions such as opening document and submitting a query. But we think these data is not su cient enough because only part of actions are provided so that we cannot analyze users' preference by their actions.

In conclusion, we put up with three suggestions. The rst one is that more complete relevance labels should be provided. Then we think more participants can join in the data collection to provide more personalized data. The last one is that we think detailed user actions can help improve the performance. 6

Conclusions

We have proposed a view of PIR task that implies that personalization should be with respect not only to context, but to the various information that people have during the course of an information search session. We focus on taking user's feedback into account and propose two extend models : Query Expansion method and Topic-Sensitive user model. We rst conduct experiments on each query session, results show that di erent session have the wide variations performance. Then we compared baselines with extended models. Noting that topic-sensitive strategy does not work very well, insu cient relevance information has a negative impact on the performance of models and evaluation process. We will extract more useful features and focus on the learning to rank approaches in the future.

Ali . Dynamic personalized ranking of facets for exploratory search . In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 1379 { 1379 . ACM, 2017 .

Carevic ,

Lusky , W. van Hoek, and

Mayr . Investigating exploratory search activities based on the stratagem level in digital libraries . International Journal on Digital Libraries , pages 1 { 21 , 2017 .

Dou ,

Song , and

J. R.

Wen . A large-scale evaluation and analysis of personalized search strategies . In International Conference on World Wide Web , pages 581 { 590 , 2007 .

Guo ,

Wu ,

Wang , and

Tan . Multiple Attribute Aware Personalized Ranking . Springer International Publishing, 2015 .

Guo ,

Wu ,

Wang , and

Tan . Personalized ranking with pairwise factorization machines . Neurocomputing , 214(C): 191 { 200 , 2016 .

Mitsui , J. Liu,

N. J.

Belkin , and

Shah . Predicting information seeking intentions from search behaviors . In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 1121 { 1124 . ACM, 2017 .

Qiu and

Cho . Automatic identi cation of user interest for personalized search . In Proceedings of the 15th international conference on World Wide Web , pages 727 { 736 . ACM, 2006 .

Teevan ,

S. T.

Dumais , and

Horvitz . Personalizing search via automated analysis of interests and activities . In ACM SIGIR Forum , volume 51 , pages 10 { 17 . ACM, 2018 .

9. W. van Hoek and

Carevic . Building user groups based on a structural representation of user search sessions . In International Conference on Theory and Practice of Digital Libraries , pages 459 { 470 . Springer, 2017 .

10.

G. H.

Yang ,

Dong ,

Luo , and S. Zhang. Session search modeling by partially observable markov decision process . Information Retrieval Journal , 21 ( 1 ): 56 { 80 , 2018 .