LIG at CLEF 2015 SBS Lab Nawal Ould-Amer1 , Philippe Mulhem1 , and Mathias Géry2 1 Univ. Grenoble Alpes, LIG, F-38000 Grenoble, France CNRS, LIG, F-38000 Grenoble, France 2 Université de Lyon, F-42023, Saint-Étienne, France, CNRS, UMR 5516, Laboratoire Hubert Curien, F-42000, Saint-Étienne, France Université de Saint-Étienne, Jean-Monnet, F-42000, Saint-Étienne, France {Nawal.Ould-Amer, Philippe.Mulhem}@imag.fr, mathias.gery@univ-st-etienne.fr Abstract. This paper describes the work achieved by the MRIM re- search group of Grenoble, using some data from the LaHC of Saint- Étienne, in a way to test personalized retrieval of books for the Social Book Search Lab of CLEF 2015. Our proposal rely on a biased fusion of content-only retrieval, using BM25F and LGD retrieval models, user non-social profile based on the catalog of the requester, and social pro- files using user/user links generated from their catalogs and ratings on books. The official results obtained show a clear positive impact of user profile, and a small positive impact of the social elements we used. Post official results that present non biased fusion scores are also presented. Keywords: Fusion of scores, user profile, social links 1 Introduction This paper describes our participation to INEX Social Book Search Suggestion Track challenge. The goal of this challenge is to evaluate approaches for sup- porting users in searching collections of books based on book metadata and associated user-generated content [4]. The work described here focuses on sev- eral aspects of personalized information retrieval that integrates social networks information. Our objectives during the participation was twofold: a) to rely as much as possible on Information Retrieval Systems to handle non-social and so- cial profiles, and b) to provide a simple integration of the three elements (i.e., content, non-social profiles, social profiles) according to linear combination of scores. Relying heavily on existing tested IR tools allows us to focus on experi- menting ideas. Proposing simple score fusions allows us to analyze more easily how configurations behave. More precisely, our experiments conducted for SBS 2015 emphasizes on: – Studying the impact of linear fusions of score compared to classical weighted linear fusions; – Studying the impact of fusing several content-only results; – Studying the impact of using a simple user profile as query extension (non- social profile); – Studying the impact of generated friend relations on the quality of the results (social profile). From the data provided by the SBS 2015 dataset, the following elements were used at one time or another: – the fields title, summary, content and tags from the documents: all concate- nated for unstructured retrieval, and separated for field-based retrieval using BM25F; – the fields title, mediated query and narrative from the topics; – the documents and ratings from the “topic users” (a topic user is the de- scription of the user that asks a query): used to compute “friendship” rela- tionships between users; – the documents and ratings from the profiles of the non-topics users: used to compute “friendship” relationships between users. The IR processes were achieved on the Terrier system3 [5]. The section 2 focuses on the description of the fusion that was exploited: one original point relates to the biases that we propose. Section 3 tackles multiple content-only matching for the documents, as we found out that such integration is beneficial. Then, we introduce in section 4 the use of non-social profiles, and we detail how we defined friendship relations between users, using their catalogs and ratings, as well a the way we used the profiles of such friends when processing queries. Additional processing must be achieved on SBS data to get results. We discuss in section 5 some of these elements before depicting in section 6 the official, as well as some non-official, results obtained, before concluding in section 7. 2 Biased linear fusions of scores Fusing scores of several IR systems is nontrivial problem. In our case, as described in the introduction, we propose to use biased linear fusions of scores, as an extension of the “Zero-one” normalization used by Wu, Crestani and Bi in [8]. Let us focus on a fusion of two results lists L1 and L2 , composed of couples (doc,rsv). To be realistic, L1 and L2 are limited to the top n results. Assume that a document d has a score value of score(d, Li ) in Li , with i ∈ {1, 2}, and d is at rank rank(d, Li ) in Li ; that bi is the bias of Li ; and that hi is the horizon (a rank position) above which we do not look at the in results list Li . The nor- malized score of d in Li is then: ( (1 − vmax(L i )−score(d,Li ) vmax(Li )−vmin(Li ) ) + bi if rank(d, Li ) ≤ hi f (d, Li ) = 0 otherwise 3 Terrier: http://www.terrier.org with vmin and vmax the minimal and maximal value of scores in a result list. Compared to the fitting used by [8], our idea is that we allow different search results to fit into different intervals. This is independent of the way different result lists are combined, but a kind of “boosts” that forces the final score values for a result list to be in {0} ∪ [bi , bi + 1] (the value 0 denotes that the document does not occur in the top hi elements of the list). This boost is independent from the way the scores are fused afterward. If we make a parallel with the general fitting proposed by [8], our proposal allows an independent scaling for each list fused. Then, the overall fusion computes a weighted average of the normalized scores (COMB-sum from [7]) using a parameter α that denotes the relative importance of L1 over L2 , and rerank the results according to the new fused scores. Com- pared to a usual weighted average, the difference here comes mainly from the bi . Assigning 0 to all bi s leads to a usual weighted average COMB-sum. We discuss the impact of such biases in the section dedicated to the experi- ments. 3 Fusion of content-only scores (run LIG 1) On experiments conducted over the SBS 2014 dataset, we noticed that fusing several content-only runs had a positive impact with a relative nDCG@10 im- provement larger than 10%. That is why we propose to fuse one result coming from BM25F [6] run (parameters values taken from [3]) and one result coming from a Log logistic model [2]. Grid optimization of parameters on SBS 2014 data led to the parameters used for the official run tagged LIG 1, described in table 1. The fusion score is computed as follows: RSVLIG 1 (Q, d) = αBM 25F (ScoreBM 25F (Q, d) + bBM 25F ) + αLGD (ScoreLGD (Q, d) + bLGD ) (1) where αBM 25F and αLGD are the relative importance of BM25 scores and LGD score respectively, bBM 25F and bLGD are bias of each results list. The nor- malized scores use an horizon h at 1000 documents. Table 1. Parameters for the content-only run LIG 1 α b h BM 25F 0.4 0.5 1000 LGD 0.6 0.4 1000 4 Personalized IR exploiting profiles 4.1 Non-social user profile (run LIG 2) What we depict here as “non-social” corresponds to the individual user data. In our case, these data refer to the catalog of the user. We assume that Catu denotes the catalog (list of books) of a given user u (from the corpus U of users). To construct a user profile, we take inspiration from Cai and Li who consider each user profile as a vector of tags and use a L1 normalized term frequency (NTF) to denote the preference degree of user on a tag [1]. Similarly, we describe the profile P rofu of a user u as a weighted vector based on Catu , where each term is weighted by its NTF. In a way to keep only the major interests of a user, we consider only the top n terms according to their values. In our runs, we keep the top n = 100 terms in a profile. Such profile is used as an expansion of initial query. In a way to reflect the relative importance of a term in a profile, we define a function Exp(Q, u) that expands the query by a fixed number of terms, and the relative importance of each term in the profile is reflected as corresponding number of occurrences of this term in the expanded query. For instance, if the value corresponding to the term t in a user profile accounts for 40% of the occurrences of terms in the profile, and suppose that we fix the number of terms added in the query to 100, then the query will be expanded by 0.4 ∗ 100 = 40 occurrences of the term t. Then a BM25 retrieval is achieved on the documents corpus. We noticed in SBS 2014 data that such expansion does not provides good results, but that the fusion of the results of such expanded queries and the usual content-only queries lead to better results. That is why we experimented such fusion with the parameters defined in table 2, where N SP rof denotes the parameters related to the non social profile fusion. The overall score for LIG 2 is: RSVLIG 2 (Q, d, u) = αBM 25F (ScoreBM 25F (Q, d) + bBM 25F ) + αLGD (ScoreLGD (Q, d) + bLGD ) + αN SP rof (ScoreBM 25 (Exp(Q, u), d) + bN SP rof ) (2) where αBM 25F , αLGD , αN SP rof are the relative importance of BM25 re- sult list, LGD results list, the non social profil list respectively. Also, bBM 25F , bLGD , bN SP rof are the respective bias of each results list. Moreover, the normal- ized scores use a horizon h of 1000, as presented in the table 2. 4.2 Friendship link generation For the social user profile, we choose to generate “friendship” links between topic users and the non-topic users provided by SBS. To achieve that, we assume that what makes (topic or non-topic) users similar to others is their catalog and the ratings they provide. We represent then all the non-topics users as a text Table 2. Parameters for the content + non social profile run LIG 2 α b h BM 25F 0.4 0.5 1000 LGD 0.5 0.5 1000 N SP rof 0.1 0.5 1000 document corresponding to concatenation of the document ids from the user catalog. We include the ratings (integer values) by using the ratings as the tf values for the number of occurrences of the documents ids. To be able to find the non-topic users similar to topic users, we describe the users topics in the same way as the non-topic users as described above. Then we used the topic-users descriptions as queries on the corpus of non-topic users using a classical BM25 matching. For first experimentation, we filter the relationships to the top 2 most similar non-topic users for each topic user, and we plan to experiment the top k similar users in future works. 4.3 Usage of “friends” Once 2 closer friends of a topic user are obtained, we apply a process similar to section 4 to generate the non-social profiles of the friends, and then we match the topic query with the friends profiles to get documents that match the query. The matching is computed as follows: RSVLIG 3 (Q, d, u) = αBM 25F (ScoreBM 25F (Q, d) + bBM 25F ) + αLGD (ScoreLGD (Q, d) + bLGD ) + αN SP rof (ScoreP rof il (Q, d, u) + bN SP rof ) + αF ri1 (ScoreBM 25 (Exp(Q, F ri1), d) + bF ri1 ) + αF ri2 (ScoreBM 25 (Exp(Q, F ri2), d) + bF ri2 ) (3) The fusion parameters used for the officially submitted run LIG 3 are given in table 3, with F ri1 and F ri2 the two friends of u. Table 3. Parameters for the content + Non Social profile + Friends profiles run LIG 3 α b h BM 25F 0.35 0.5 1000 LGD 0.45 0.5 1000 N SP rof 0.1 0.6 1000 F ri1 0.05 0.6 1000 F ri2 0.05 0.5 1000 5 Documents given as “examples” (runs LIG 4, LIG 5 and LIG 6) One important point to notice is that the post processing of the obtained re- sults have a dramatic impact on the results. For instance, as the initial corpus ids (isbn) are not the ones on which the results are evaluated (LibraryThing ids), and because of potential duplicates generated, it is not obvious to handle the translation. Our approach for such duplicate removal was the same that is provided by the organizers of SBS. Additionally, for our runs for SBS 2015, we focused on integrating the users examples to post-process the queries. Our idea was that a user might be inter- ested if he finds as answers documents that he read and that he appreciated, as this would be an indicator that the system is providing relevant documents to him. We declined this hypothesis in two ways: Reranking: Achieve a reranking where the documents a user likes are boosted, and the documents he dislikes are removed for the result. After a “Zero-one” score normalization [8] between 0 and 1 of the overall score, we add 1 for the documents that the user likes and set the score to 0 for the documents that he dislikes. This process is then a post-processing that is run after the fusion, and is the result of our official run LIG 4. It is worth noting that, if several retrieved documents are liked by the user, their relative initial ranking is preserved; Relevance Feedback: Define a relevance feedback, positive for the documents that the topic user likes, and negative for the documents he does not like. We achieved such relevance feedback on the Log logistic run LGD for our official run LIG 5, and also on both content-runs, i.e., BM25F and Log lo- gistic, for our official run LIG 6. The relevance feedback uses all the positive documents and selects the top 10 terms according to the default selection of Terrier [5]. 6 Results We present here two elements. First, we list the official results obtained for our 6 runs officially submitted to SBS 2015. Second, we discuss additional results generated after the release of the SBS 2015 qrels, presenting the impact of the “biases” we used (see section 2) over “unbiased” results. 6.1 Official Results We comment here mainly the lines of table 4 corresponding to boldfaced run ids that use nor reranking neither relevance feedback. We notice then that the impact of the user non-social profile is clearly beneficial, however the p-value of a bilateral paired Student t-test on LIG 1 versus LIG 2 equal 5.45%, thus with a significance threshold of 5% this difference is not statistically significant. Such value is even larger between LIG 1 and LIG 3. We notice a slight improvement of nDCG@10 results when integrating the 2 best “friends”, however our generation or usage of relationships between users does not seem to be effective enough. According to what we defined for our fusion (see section 2), we also notice that the weights assigned to the friends are very small, 0.05. With higher relative values the results degrade. So we conclude for now that our proposal does not outperform the integration of non-social user information. Table 4. Official results for the 6 LIG runs Rank Run nDCG@10 MRR MAP R@1000 Profiles 6 LIG 3 0.098 0.189 0.069 0.514 yes 7 LIG 2 0.096 0.185 0.069 0.514 no 8 LIG 4 0.095 0.181 0.068 0.514 yes 13 LIG 5 0.093 0.179 0.067 0.515 yes 14 LIG 6 0.092 0.174 0.067 0.513 yes 15 LIG 1 0.090 0.173 0.063 0.508 no As we see on table 4, the results with reranking or relevance feedback lower the quality of the results, but these elements are related to our interpretation of the catalogs and examples that are incompatible with the interpretation of the SBS organizers. In fact, our interpretation was somewhat the exact contrary of what decided the SBS organizers (they choose that the catalog + examples must not be part of the result), this explains why these additional runs behave worse than our initial runs. 6.2 Impact of biases We describe in table 5 the impact of using the biases as defined in section 2. To be fair compared to the official results, we choose to only remove the bias from the configurations used for the official runs and to compare the relative gain or loss (between parentheses) with respect to the biased respective runs from table 4. In this table, the values use 3 digit precision numbers, where the percentages are computed on 4 digit precision numbers. We notice that the effect of the bias are positive for all the measures for the runs LIG 2 and LIG 3, and have almost no effect of the content only LIG 1 run. So, the effect of the biases seem to be more positive as we fuse many lists. 7 Conclusion We presented in this paper the experiments that were conducted for the par- ticipation of LIG to the SBS 2015 lab evaluation. Our main finding, according to our integration of non-social and social profiles, is that the use of non-social Table 5. Unofficial results without bias Run nDCG@10 MRR MAP R@1000 LIG 1 no bias 0.090 (0.0 %) 0.173 (-1,2 %) 0.063 (0.0 %) 0.507 (-0.1 %) LIG 2 no bias 0.096 (-0.8 %) 0.182 (-1.6 %) 0.068 (-1.0 %) 0.513 (-0.2 %) LIG 3 no bias 0.097 (-0.7 %) 0.181 (-4.2 %) 0.068 (-3.0 %) 0.508 (-1.2 %) profile has a clear positive impact on the quality of the retrieval, where the inte- gration of generated friendship relationships does not really increase the quality of the system provided. One important conclusion that we draw from the SBS experiments is that the post processing of results has a dramatic impact on the quality of the results, and then must be carefully studied. The experiments reported here depict our first steps to grasp the complexity of personalized information retrieval in social context, and many efforts will focus on refining and characterizing the numerous elements involved in such retrieval process. Acknowledgment This work is supported by Région Rhône-Alpes through the ReSPIr project. References 1. Cai, Y., Li, Q.: Personalized search by tag-based user profile and resource profile in collaborative tagging systems. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management. pp. 969–978. CIKM 10, ACM, New York, NY, USA (2010) 2. Clinchant, S., Gaussier, E.: A Log-Logistic Model for Information Retrieval. In: 18th ACM Conference on Information and Knowledge Management. CIKM 10, vol. 14, pp. 5–25. Hong-Kong, China (2009) 3. Hafsi, M., Géry, M., Beigbeder, M.: LaHC at INEX 2014: Social book search track. In: Working Notes for CLEF 2014 Conference. pp. 514–520 (2014) 4. Koolen, M., Bogers, T., Kamps, J., Kazai, G., Preminger, M.: Overview of the INEX 2014 social book search track. In: Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18, 2014. pp. 462–479 (2014) 5. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: A High Performance and Scalable Information Retrieval Platform. In: Proceedings of ACM SIGIR’06 Workshop on Open Source Information Retrieval (OSIR 2006) (2006) 6. Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and be- yond. Found. Trends Inf. Retr. 3(4), 333–389 (2009) 7. Shaw, J.A., Fox, E.A.: Combination of multiple searches. In: The Second Text RE- trieval Conference (TREC-2). pp. 243–252 (1994) 8. Wu, S., Crestani, F., Bi, Y.: Evaluating score normalization methods in data fusion. In: Ng, H., Leong, M.K., Kan, M.Y., Ji, D. (eds.) Information Retrieval Technology - Third Asia Information Retrieval Symposium, AIRS 2006. vol. 4182, pp. 642–648. Springer Berlin Heidelberg (2006)