ECNU at CLEF PIR 2018 : Evaluation of
          Personalized Information Retrieval

              Qingchun Bai1 , Jiayi Chen1 , Qinmin Hu2 and Liang He1
    1
         School of Computer Science and Software Engineering, East China Normal
                              University, Shanghai, China
                {qchbai, jychen}@ica.stc.sh.cn, lhe@cs.ecnu.edu.cn
        2
          Department of Computer Science, Ryerson University, Toronto, Canada
                                  vivian@ryerson.ca


         Abstract. Personalized Information Retrieval (PIR) is an effective so-
         lution when purposes of queries are issued but users receive the same
         results. The PIR-CLEF 2018 task aims to explore the methods and eval-
         uations of PIR. By analyzing the provided data we generate query level
         and session level baselines. We compare baselines and extended mod-
         els we propose, and experiment results show that insufficient relevance
         information has a negative impact on the performance of models and
         evaluation process. Since personalization ranking based on typical users
         interests is not effective in reality, especially when the results of relevance
         feedback is not satisfactory, we consider that the PIR task should not
         only relate to context, but to the various search intentions. We propose
         several suggestions about data and evaluation process.

         Keywords: Personalized Information Retrieval; Query Expansion; Data
         Analysis


1   Introduction
    The PIR-CLEF 2018 task aims to explore the methods and evaluations of
Personalized Information Retrieval (PIR). PIR has drawn great attention to
help understand user’s behaviors in the interaction with IR systems. Personalized
search is an effective solution when queries purposes are issued but users received
the same results.
    Existing works [3,5,4] have proved that personalized ranking can be consid-
ered as a good solution for PIR task. The foundation and key of personalized
ranking service is how to obtain persons attempt to frame user’s interest model.
Various penalization strategies are carried out, for example, In [7], a method is
discussed to identify user’s interest automatically, which based on the assump-
tion that a user’s general preference may help the search engine disambiguate
the true intention of a query. The approach described in [8] considered a user’s
prior interactions with a wide variety of content to personalize that user’s cur-
rent web search. More recently, in [1], a dynamic personalized ranking model is
proposed to recommend the most relevant information which combined different
sources of information.
    For another piece of research, the research focus on understanding the user’s
intent of search session information[2,6,9,10], results show that it is possible to
understand the user’s intent, since people all have intentions in the process of
seeking information and also have reasons to believe these information seeking
intentions.
    These prior works on personalized information retrieval have focused on in-
dependent issues with independent data. A few of them also have focused on
the analysis of required data and evaluation of personalized ranking. In fact, we
consider that personalized ranking based on typical user’s interests is not effec-
tive in reality, especially when the results of relevance feedback are not good,
the re-ranking model cannot achieve the desired result.
    Therefore, in this study, we aim to explore the potential of the task and
understand both the data and evaluation of the personalized search and ask the
following research questions:
 – If the provided PIR data can satisfied the task?
 – How to achieve personalization, and what kind of data is needed to support
   this research?
 – How to evaluate it?
To achieve this aims, this paper is organized as follows. In section 2, we briefly
review and analysis the current PIR data. In section 3, we describes our baselines
method to the data in detail. In section 4, experiments and results are presented.
Finally, we discuss the task and evaluation in section 5.
2   Data Review and Analysis
   We present the statistics about dataset in this section, the review of current
dataset described in Section 2.1, and explores the potenital of the dataset. Then
we analysis the query session given by the data in Section 2.2.

Statistics about Dataset In this section, we briefly review current dataset of
PIR task and provide a comprehensive analysis on this task. In PIR-CLEF 2018,
data are provided with six csv files including information below:
 – the search tasks (sessions) of ten users;
 – the queries submitted by all users and all documents returned by ClueWeb
   API;
 – relevance scores labeled by users and original ranks of documents;
 – personal information like gender and job;
 – remarks written by users;
 – statistical information of terms in queries.

Statistics about Sessions A user can submit several queries in a query ses-
sion. These queries are aiming at different objectives. To find the objective of
the user, we gather all queries as one query to represent the objective. We need
to submit this query to the API and evalutate the performance.
3     Methods
    The users submit their queries to the ClueWeb API3 and annotate whether
the returned documents are relevant. The users divide relevance into four grades:
relevant, somewhat relevant, not relevant and off topic with scores ranging from
four to one. According to the description above, we define that documents are
relevant to the query only when those scores are four. Figure 1 shows the frame-
work of the personalized ranking part.


                                   Personalized Ranking


                                                          Topic-Sensitive
                                         User(U)
                  Query(Q)                                  User model

                                          Query
                                       Session(QS)
                ClueWeb API
                                                          Query Expansion   Personalized
                                        Query(Q)                               Ranking
            Personalized Ranking

                                       Testing Sample
                                            (Q,U)

                                                                            Score(Q,U,D)


                       Fig. 1: Personalized Ranking Framework.


Baselines We propose two baselines: query-level baseline and session-level base-
line. In query-level baseline, we evaluate each query independently. While in
session-level baseline, we collect all relevant documents of queries in the session
and consider them as the relevant documents of the search task. We evaluate
the performance of each query on its search task. We also assume that queries
belonging to one session represent different aspects of user’s need. So we sum up
all queries in a session to one and submit it to the ClueWeb API. We evaluate
and display the performance of each session baseline in Table 1.

Query Expansion We first take user’s feedback into account. After the user
labels the documents with relevance score, we choose ten words with highest
frequency in relevant documents as expansion words and add them to the original
session-level query. We submit the new queries to the API and evaluate the
performance of this method. Then we assume top-20 documents returned from
API are relevant and select ten most frequent words in documents as expansion
words. The first method is listed in Table 2 with suffix “RF” while the second
one is named with suffix “PRF”.
3
    http://clueweb.adaptcentre.ie/WebSearcher/search?query=queryString
    &page=pagenumber
Topic-Sensitive User model We propose a language modeling approach to
personalized search based on users’ search behavior and preference. To capture
the user’s searching interesting and implicit purpose, we propose to use LDA-
based approach for simulating users which does not merely focus on simulating
the search behaviours but also considers search sessions of the task.


4     Experiments and Results

4.1    Performance of Each Query Session

    Table 1 shows the performance results of each query session, there are total
14 automatic session IDs and we obtain a list as follows. “Sum Up” means the
performance of the new query generated by all queries in a session and “Single
Query” is mean performance of all queries in a session. From this table, we
observe that different session have the wide variations performance. The best
result is session 156, the category of the query is “Travel”. The user gives a
higher relevance score which denotes the relevance of the document to the topic
(1 off-topic, 2 not relevant, 3 somewhat relevant, 4 relevant). the description of
the users is “The relevant documents are documents that list as many historical
and popular places in venice. I don’t want to see other documents that talk about
other related places. In addition to that I am not interested about accomodation
during my search.”


                     Table 1: Performance of Each Query Session
      Session ID         Sum Up                       Single Query
                   MAP NDCG P@5 P@10 MAP                 NDCG       P@5 P@10
         107       0.1581 0.5546 0.2000 0.3000 0.1069    0.2241    0.4222 0.3666
         154       0.0000 0.0000 0.0000 0.0000 0.0000    0.0000    0.0000 0.0000
         156       0.7588 0.9294 1.0000 1.0000 0.3878    0.5295    0.5000 0.5000
         161       0.1321 0.4625 0.2000 0.3000 0.0839    0.1985    0.2800 0.2600
         162       0.1621 0.5695 0.2000 0.3000 0.0702    0.2152    0.3333 0.3667
         172       1.0000 1.0000 0.2000 0.1000 0.2000    0.2000    0.0400 0.0200
         173       0.0890 0.3940 0.4000 0.4000 0.0329    0.1035    0.2375 0.1625
         175       0.0120 0.1947 0.0000 0.0000 0.0523    0.1751    0.1666 0.1666
         176       0.0000 0.0000 0.0000 0.0000 0.0000    0.0000    0.0000 0.0000
         201       0.0990 0.3851 0.2000 0.1000 0.0893    0.2334    0.2000 0.1400
         202       0.3711 0.6418 0.2000 0.1000 0.3711    0.6418    0.2000 0.1000
         203       0.4052 0.6232 0.6000 0.4000 0.4052    0.6232    0.6000 0.4000
         204       0.0000 0.0000 0.0000 0.0000 0.0000    0.0000    0.0000 0.0000
         205       0.0060 0.1706 0.0000 0.0000 0.1145    0.1448    0.0800 0.1000
        mean       0.2281 0.4232 0.2285 0.2142 0.1367    0.2349    0.2185 0.1844
4.2   Performance Comparison
    We further make a comparison between baselines and our methods in Table 2.
In query-level evaluation, our methods are worse than baselines. In session-level
evaluation, relevance feedback method performs better than baselines because we
can obtain users’ interests from relevant documents. However pseudo relevance
feedback and LDA methods get worse performance than baseline.


                       Table 2: Performance Comparison
                    Methods      MAP NDCG P@5 P@10 P@20
                Query-Baseline 0.2216 0.3239 0.1922 0.1688 0.0929
                   Query-RF      0.1084 0.1908 0.1013 0.0000 0.1908
                  Query-PRF 0.1460 0.2300 0.1169 0.0935 0.0623
                Session-Baseline 0.2283 0.4232 0.2286 0.2143 0.1619
                  Session-RF 0.2743 0.4251 0.4000 0.2714 0.2190
                 Session-PRF 0.0695 0.1549 0.1286 0.1071 0.0929
                 Session-LDA 0.1549 0.3005 0.1714 0.1786 0.1357


5     Discussion
    It is within our expectation that all query-level evaluations are worse than
that of baseline. We think this phenomenon is caused by the lack of relevant
documents. In the provided data, each user only labels about twenty documents
whose original ranks range from 0 to 100 and few of them are relevant. Assuming
the users get 100 documents returned per query, eighty percent of relevance
information is lost. In this scenario, any document not occurring in this list is
considered as irrelevant which means a relevant document can be annotated as
irrelevant from the perspective of user. Insufficient relevance information even
make us hard to evaluate on certain queries. In session 154, 176 and 204, all
documents are irrelevant so that even though we find the potentially relevant
documents, we cannot know whether they are relevant from the angle of users.
    We also think the evaluation process should be upgraded. Unlike existing
search task like TREC tracks, personalized information retrieval focuses more on
individual differences. In PIR-CLEF 2018, some users receive the same task but
the queries they submit are different. For example, user 8,11 and 12 receive the
same task of traveling, but their queries are about Dublin, Tokyo and Barcelona.
The individual differences are expressed by queries so we think this task is still
an ad-hoc retrieval task. So if we want to focus on individual differences, we
need more users to join in the data collection.
    We suggest that the complete logs of users can be provided. By analyzing the
relevant documents annotated by users, we get an improvement in session aspect
as listed in Table 2. However our method still can be improved. In this task we are
provided with users’ actions such as opening document and submitting a query.
But we think these data is not sufficient enough because only part of actions are
provided so that we cannot analyze users’ preference by their actions.
    In conclusion, we put up with three suggestions. The first one is that more
complete relevance labels should be provided. Then we think more participants
can join in the data collection to provide more personalized data. The last one
is that we think detailed user actions can help improve the performance.
6    Conclusions
    We have proposed a view of PIR task that implies that personalization should
be with respect not only to context, but to the various information that people
have during the course of an information search session. We focus on taking user’s
feedback into account and propose two extend models : Query Expansion method
and Topic-Sensitive user model. We first conduct experiments on each query
session, results show that different session have the wide variations performance.
Then we compared baselines with extended models. Noting that topic-sensitive
strategy does not work very well, insufficient relevance information has a negative
impact on the performance of models and evaluation process. We will extract
more useful features and focus on the learning to rank approaches in the future.

References
 1. E. Ali. Dynamic personalized ranking of facets for exploratory search. In Proceed-
    ings of the 40th International ACM SIGIR Conference on Research and Develop-
    ment in Information Retrieval, pages 1379–1379. ACM, 2017.
 2. Z. Carevic, M. Lusky, W. van Hoek, and P. Mayr. Investigating exploratory search
    activities based on the stratagem level in digital libraries. International Journal
    on Digital Libraries, pages 1–21, 2017.
 3. Z. Dou, R. Song, and J. R. Wen. A large-scale evaluation and analysis of person-
    alized search strategies. In International Conference on World Wide Web, pages
    581–590, 2007.
 4. W. Guo, S. Wu, L. Wang, and T. Tan. Multiple Attribute Aware Personalized
    Ranking. Springer International Publishing, 2015.
 5. W. Guo, S. Wu, L. Wang, and T. Tan. Personalized ranking with pairwise factor-
    ization machines. Neurocomputing, 214(C):191–200, 2016.
 6. M. Mitsui, J. Liu, N. J. Belkin, and C. Shah. Predicting information seeking
    intentions from search behaviors. In Proceedings of the 40th International ACM
    SIGIR Conference on Research and Development in Information Retrieval, pages
    1121–1124. ACM, 2017.
 7. F. Qiu and J. Cho. Automatic identification of user interest for personalized search.
    In Proceedings of the 15th international conference on World Wide Web, pages
    727–736. ACM, 2006.
 8. J. Teevan, S. T. Dumais, and E. Horvitz. Personalizing search via automated
    analysis of interests and activities. In ACM SIGIR Forum, volume 51, pages 10–
    17. ACM, 2018.
 9. W. van Hoek and Z. Carevic. Building user groups based on a structural represen-
    tation of user search sessions. In International Conference on Theory and Practice
    of Digital Libraries, pages 459–470. Springer, 2017.
10. G. H. Yang, X. Dong, J. Luo, and S. Zhang. Session search modeling by partially
    observable markov decision process. Information Retrieval Journal, 21(1):56–80,
    2018.