Introduction

SOCIAL BOOK SEARCH TRACK: ISM@CLEF'16 SUGGESTION TASK

Ritesh Kumar

ritesh4rmrvs@gmail.com 0

Guggilla Bhanodai

bhanodaig@gmail.com 0

Rajendra Pamula

rajendrapamula@gmail.com 0 0 Department of Computer Science and Engineering, Indian School of Mines Dhanbad , 826004 India

This paper describes the work that we did at Indian School of Mines towards Social Book Search Track for CLEF 2016. As per requirement of CLEF-2016 we submitted six runs in its Suggestion Task. We investigated individual e ect of title, group, request, as well as combined e ect of title, request and group elds of the topics in our runs. For all the runs we used language modeling technique with Dirichlet smoothing. The run using combined e ect of title, request and group eld was our best. Overall, our performance is good but it needs some improvement, our scores are encouraging enough to work for better results in future.

Book Search Social Book Search Language modeling Information Retrieval re-ranking Normalization

Introduction

With growing numbers of online portals and book catalogues, our current time sees a rapid evolution in the way we acquire, share and use books. In order to enable users for searching the relevant books, Social Book Seach Track at CLEF [ 5 ] provides a relevant experimental platform to investigate techniques of searching and navigating professional metadata. These metadata are provided by publishers/booksellers and user-generated content from social media [ 1 ]. In CLEF 2016 at Social book Search Lab, they o ered three di erent tracks: Suggestion Track, Interactive Track and Mining Track. We participated in the suggestion track where we were supposed to recommend books based on user's request and her personal catalogue data (list of books with rating and tags maintained for the user in the social cataloguing site). We were also provided with a large set of anonymised user pro les from LibraryThing forum members, consisting of almost 93,976 anonymised user pro les from LibraryThing with over 33 million cataloguing transaction. Each user request is provided in the form of topics containing di erent elds like title, request, group, examples and catalogue information.

Our goal is to investigate the contribution of di erent topic elds as well as combining e ect of some elds for book recommendation. We only considered title, request, group elds from each topic.We did not consider topic-creator’s catalogue information nor did we consult the user pro les.

We submitted six runs (ISMD16all eds, ISMD16title eld, ISMD16request eld, ISMD16titlewithoutreranking, similaritytitle eldreranked, ISMD16group eld) in the Suggestion Task. For all the runs, Language modelling with Dirchlet smoothing was used in Lemur's Indri search system [ 3 ].

The organization of rest of the paper is as follows. Section 2 describes about dataset. we describe our methodology: eld categories and indexing, which document and topic elds we used for retrieval in section 3. Section 4 describes what approaches we have used, Section 5 reports results. In Section 6 we analyse our results. Finally, we conclude in Section 7 with directions for future work. 2

Data

The test collection provided by CLEF 2016 SBS orgainzers for Suggestion Task had a document collection and a topicset. The document collection consists of 2.8 million book description with metadata from Amazon and LibraryThing. In Amazon there is formal metadata like booktitle, author, publisher, publication year, library classi cation codes, Amazon categories, similar product information and user-generated content in the form of user ratings and reviews. In Amazon, there are user tags and user-provided metadata on awards, book characters, locations and blurbs. There are additional records from the British Library and the Library of Congress. The entire collection was 7.1 GB in size [ 2 ].

The topic-set contains 120 topics each describing a user's request for suggestion of books. Each topic has a set of elds like title, request, group, example and user's personal catalogue at the time of topic creation. The catalogue contains a list of book-entries with information like LibraryThing id of the book, its entry-date, rating and tags.

The organizers also supplied 94,000 anonymised user pro les from LibraryThing. 3

Methodology

3.1

Field categories and Indexing We are provided by Amazon/LibraryThing data collection(corpus) which consists of 2.8 million book descriptions with metadata. There are so many elds in the corpus, we took some of them for indexing which are as follow: Metadata In our metadata index, we used these metadata eld: <title> <creator>, < rstwords>, <lastwords>.

Content In our content index, we used these metadata eld:<content> of provided corpus containing, <blurbs>, <epigraph>, <quotation>.

Tags In our tags index, we used <tags> eld for indexing.

Reviews In our reviews index, we used <reviews> eld from corpus. 3.2

Topics This year's Suggestion task has provided 120 topics, With help of these we built four set of queries which are: Topic-Title: Only the<title> eld of each topic.

Topic-Request: It contains only the <request> eld.

Topic-group: Only the <group> eld.

Topic-All-Fields: It contains <title>, <request>, <group> eld. 4

Approach

In our approach we analyzed two methods rst one i.e. Content Based Retrieval and secondly re-ranking approach after rank normalization of the scores of the retrieved documents. For both retrieving approaches we used Language modeling with Dirichlet smoothing. The document collection provided was stopwordremoved using SMART stop word list and then stemmed using Krovetz stemmer. We did not remove stopwords from provided topics. For retrieving and indexing we used Lemur 5.9 search system. We also removed punctuation marks from all the textual content of these elds and used only free text queries in all the runs.We did not consider any other information like catalogue information and user pro le during retrieval. For each topic, we submitted up to 1000 book suggestions in the form of ISBNs. 4.1

Content Based Retrieval During retrieval, we tried to see the e ect of di erent components of a topic one by one as well as combined contribution of all the topics except <example> eld. It is simply based on adhoc retrieval. We can see the result given in Table 1. In this method we are inspired by Social Feature Re-ranking Method proposed by Toine Bogers in 2012 [ 6 ]. In order to improve the initial ranking, we perform re-ranking by two di erent strategies after analyzing the structure of XML: Item-Rerank (I) and RatingReview-Rerank (R), For re-ranking we have used following stages:

Similarity Calculation : The similarity of two documents based on feature I is calculated by equation (1) simij (I) = (1 : i is j0s similar product or j is i0s similar product

0 : otherwise score0(i) = score(i) + (1 )

N X simij score(j)(j 6= i) j=1 (1) (2)

Re-Ranking : We re-rank the top 1000 list of initial ranking for the above mentioned features by Equation (2).

Pr2Ri r jreviews(i)j score(i) (3) score0(i) = score(i) + (1

) log(jreviews(i)j) For feature R, we use Equation (3) [ 7 ].

Before re-ranking we apply rank normalization on the retrieved results to map the score into the range [ 0, 1 ] [ 8 ]. The balance between the original retrieval score, score(i) and the contributions of the other books in the results list is controlled by the parameter, which takes values in the range [ 0, 1 ], but in our experiment we have taken xed value i.e. = 0.96. Due to lack of time, we couldn't try with any other value. 5

Results

The scores obtained by our six runs are given in Table 2. The o cial evaluation measure provided by CLEF'16 is nDCG@10 [ 4 ]. The performance of our runs are in decreasing order. Our best performance is by ISMD16all eds where we use title,request and group eld. We also show the best score in the task demonstrated by run-id run1.keyQuery active combineRerank(*), for the sake of comparison. ISMD16all eds 24 0.1722

ISMD16title eld 28 0.1197

ISMD16request eld 29 0.1454 ISMD16titlewithoutreranking 33 0.1114 similaritytitle eldreranked 35 0.0966

ISMD16group eld 43 0.0527 best* 1 0.5247 Although our performance is not up to the mark, there are few take-home lessons. In our run id:ISMD16all eds, ISMD16title eld, ISMD16request eld and ISMD16group eld, we have reranked the retrieved score based on reviews(R) by taking =0.96.

In our top score i.e. ISMD16all eds, we have taken combination of all the elds title, request, group except example eld from topic, In ISMD16title eld, we have taken onlytitle eld, In ISMD16request eld we have taken only eld of the topic, For ISMD16group eld we have taken only group eld. For run id: ISMD16titlewithoutreranking we simply used as content based retrieval. For run id: similaritytitle eldreranked we have used similarity as well as reranking by taking = 0.96. 7

Conclusion

This year we participated in the Suggestion Task of Social Book Search. We tried to see the individual e ect as well as combined e ect of di erent topic- elds on book recommendation. We considered only a handful of elds like request, title, group etc from the topics. While there can be no denial of the fact that our overall performance is average, initial results are suggestive as to what should be done next. We need to consult other elds like book catalogue of the topic creators, ratings of the books in the catalogue during retrieval. We also need to take into account pro les of other users. It is also imperative to see the learning to rank for di erent elds, and taking the parameter range between [ 0,1 ], this time we have taken xed vale of = 0.96. We will also use other elds in user catalogues and user pro les. We shall be exploring some of these tasks in the coming days.

Marijn

Koolen , Gabriella Kazai, Jaap Kamps, Michael Preminger, Antoine Doucet and Monica Landoni, Overview of the INEX 2012 Social Book Search Track . INEX'12 Workshop Pre-proceedings, Shlomo Geva, Jaap Kamps, Ralf Schenkel (editors), September 17-20 , 2012 , Rome , Italy.

2. INEX, Initiative for the Evaluation of XML Retrieval . https://inex.mmci.unisaarland.de/data/documentcollection.jsp

3. INDRI: Language modeling meets inference networks , Available at http://www.lemurproject.org/indri/

4. Jarvelin , K. , Kekalainen , J.: Cumulated Gain-based Evaluation of IR Techniques . ACM Transactions on Information Systems 20 ( 4 ) ( 2002 ) 422 - 446 .

5. CLEF, Conference and labs of the Evaluation Forum . http://clef2016.clefinitiative.eu/index.php

Bogers and

Larsen . Rslis at inex 2012: Social book search track . In INEX'12 Workshop Pre-proceedings, pages 97 - 108 . Springer, 2012 .

7. R. D. Ludovic Bonnefoy and P. Bellot . Do social information help book search ? In INEX'12 Workshop Pre-proceedings, pages 109 - 113 . Springer, 2012 .

8. Renda , M.E. , Straccia , U. : Web Metasearch: Rank vs. Score-based Rank Aggregation Methods . In: SAC 03: Proceedings of the 2003 ACM Symposium on Applied Computing , New York, NY, USA, ACM ( 2003 ) 841846