Social Book Search: The Impact of Professional and User-Generated Content on Book Suggestions Marijn Koolen1 Jaap Kamps1 Gabriella Kazai2 1 University of Amsterdam, The Netherlands 2 Microsoft Research, Cambridge UK, ABSTRACT prefer professional or UGC for judging topical relevance and for The Web and social media give us access to a wealth of informa- recommendation, and how standard IR models cope with UGC. tion, not only different in quantity but also in character—traditional descriptions from professionals are now supplemented with user 2. SOCIAL SEARCH FOR BEST BOOKS generated content. This challenges modern search systems based In this section we detail collection and the LT forum topics. on the classical model of topical relevance and ad hoc search. We Collection The Amazon/LT collection [1] consists of 2.8 million compare classical IR with social book search in the context of the book records from Amazon, identified by ISBN, extended with so- LibraryThing discussion forums where members ask for book sug- cial metadata from LT, marked up in XML. These records contain gestions. This paper is an compressed version of [2]. title information, Dewey classification codes and Subject headings supplied by Amazon. The reviews and tags were limited to the first Categories and Subject Descriptors: H.3.3 [Information Storage and 50 reviews and 100 tags respectively during crawling. The profes- Retrieval]: Information Search and Retrieval—Search process) sional metadata is more evenly distributed than the UGC. Books General Terms: Experimentation, Measurement, Performance have a single classification code and most have one or two subject Keywords: Book search, User-generated content, Evaluation headings, although a small fraction has no professional metadata. Typical of UGC, popular books have many tags and reviews while many others have few or none. The median number of reviews and 1. INTRODUCTION tags are 0 and 5 respectively. That is, the majority has no reviews The web gives access to a wealth of information that is different but at least a handful of tags. from traditional collections both in quantity and in character. Espe- Topics LibraryThing users discuss their books in forums dedi- cially through social media, there is more subjective and opinion- cated to certain topics. Many of the topic threads are started with ated data, which gives rise to different tasks where users are looking a request from a member for interesting, fun new books to read. not only for facts but also views and interpretations, which may re- Other members often reply with links to works catalogued on LT, quire different notions of relevance. In this paper we look at how which we connected to books in our collection through their ISBN. search has changed by directly comparing classical IR and social These requests for recommendations are natural expressions of in- search in the context of the LibraryThing (LT) discussion forums, formation needs for a large collection of online book records, and where members ask for book suggestions. We use a large collec- the book suggestions are human recommendations from members tion of book descriptions from Amazon and LT, which contain both interested in the same topic. For the Social Search for Best Books professional metadata and user-generated content (UGC), and com- task we selected a set of 211 topics, some focused on fiction and pare book suggestions on the forum with Mechanical Turk judge- some on non-fiction books. For the Mechanical Turk experiment ments on topical relevance and recommendation for evaluation of we focus on a subset of 24 topics. retrieval systems. Searchers not only consider the topical relevance MTurk Judgements We compare the LT forum suggestions against of a book, but also care about how interesting, well-written, re- traditional judgements of topical relevance, as well as against rec- cent, fun, educational or popular it is. Such affective aspects may ommendation judgements. We set up an experiment on Amazon be mentioned in reviews, but Amazon, LT and many similar sites Mechanical Turk to obtain judgements on document pools based do not include UGC in the main search index. Our main research on top-10 pooling of the 22 runs submitted by the 4 participating question is: groups. We designed a task to ask Mechanical Turk workers to • How does social book search compare to traditional search tasks? judge the relevance of 10 books for a given book request. Apart from a question on topical relevance, we also asked whether they For this study, we set up the Social Search for Best Books (SB) would recommend a book to the requester and which part of the task as part of the INEX 2011 Books and Social Search Track.1 We metadata—curated or user-generated—was more useful for deter- want to find out whether the suggestions are complete and reliable mining the topical relevance and for recommendation. We included enough for retrieval evaluation and how social book search is re- some quality assurance and control measure to deter spammers and lated to traditional search tasks. We also want to know if users sloppy workers. Averaged over workers the LT agreement is 0.52. 1 https://inex.mmci.uni-saarland.de/tracks/books/ 3. SYSTEM-CENTERED ANALYSIS We compare system rankings of the 22 official runs based on the Copyright is held by the author/owner(s). DIR 2013, April 26, 2013, Delft, The Netherlands. forum suggestions and on the MTurk relevance judgements. The . Kendall’s τ system ranking correlation between the forum sugges- Table 1: MTurk and LT Forum evaluation (nDCG@10 and re- Table 2: Impact of presence of reviews and tags on judgements call@1000) of runs over different index fields Reviews Tags MT urk 0 rev. ≥1 rev. 0 tags ≥10 tags Rel Rec Rel&Rec LT-Sug Top. Rel. (Q1) Not enough info. 0.37 0.01 0.09 0.09 Field nDCG recall nDCG recall nDCG recall nDCG recall Relevant 0.30 0.54 0.49 0.48 Title 0.212 0.601 0.260 0.545 0.172 0.591 0.055 0.350 Recommend. (Q3) Not enough info. 0.53 0.01 0.14 0.12 Dewey 0.000 0.009 0.003 0.007 0.000 0.005 0.001 0.022 Rel. + Rec. 0.22 0.51 0.46 0.45 Subject 0.016 0.008 0.021 0.010 0.016 0.009 0.003 0.009 Review 0.579 0.720 0.786 0.756 0.542 0.783 0.251 0.680 Tag 0.368 0.694 0.435 0.665 0.320 0.718 0.216 0.602 part to determine whether to recommend a book. Workers could indicate the description does not have enough information to an- swer questions Q1 (topical relevance) and Q3 (recommendation). tions for 211 topics and the MTurk judgements on the 24 topics is We see in Table 2 the fraction of books for which workers did not 0.36. This is not due to the difference between the 211 topics of the have enough information split over the descriptions with no reviews forum suggestions and the subset of 24 topics selected for MTurk, (column 2), at least one review (column 3), no tags (column 4) and as the correlation between the forum suggestions of the 211 and at least 10 distinct tags (column 5). First, without reviews, workers 24 topic sets is τ = 0.90. It could be that the forum suggestions indicate they do not have enough information to determine whether are highly incomplete. Most topics have few suggestions (median a book is topically relevant in 37% of the cases, and label the book is 7). If the suggestions are a small fraction of all relevant books, as relevant in 30% of the cases. When there is at least one review, good and bad systems will perform poorly as the chances of rank- in only 1% of the cases do workers have too little information to ing the few suggested books above other relevant books is small. determine topical relevance, but in 54% of the cases they label the However, the highest MRR score among the 22 runs is 0.481. This book as relevant. Reviews contain important information for topi- means that on average, over 211 topics, this system returns a sug- cal relevance. The presence of tags seems to have no effect, as the gested book in the top 2. If this only occurs for a few topics, it fractions are stable across books with different numbers of tags. could be ascribed to mere coincidence, but over 211 topics, such a We see a similar pattern for the recommendation question (Q3). high average is unlikely due to chance. Based on this, we argue the In summary, the presence of reviews is important for both topical forum suggestions are relatively complete but represent a different relevance and recommendation, while the presence and quantity of task from the ad hoc task modelled by the topical relevance judge- tags plays almost no role. ments from MTurk. In [2] we also show that the forum suggestions behave differently from known-item topics. Next, we created a number of our runs to compare the forum 5. CONCLUSIONS suggestions against the MTurk judgements. For indexing we use In this paper we ventured into unknown territory by studying Indri, Language Model, with Krovetz stemming, stopword removal the domain of social book search with traditional metadata com- and default smoothing (Dirichlet, µ=2,500). The titles of the forum plemented by a wealth of user generated descriptions. We also topics are used as queries. In our base index, each xml element is focused on requests and recommendations that users post in real indexed in a separate field, to allow search on individual fields. life based on the social recommendations of the forums. We ob- Generally, systems perform better on recommendation judge- serve that the forum suggestions are complete enough to be used as ments (MTurk-Rec in Table 1) than on topical relevance judgments evaluation, but they are different in nature than traditional judge- (MTurk-Rel), and their combination (MTurk-Rel&Rec) and worst ments for known-item, ad hoc and recommendation tasks. Even on the forum suggestions (LT-Sug). The suggestions seem harder though most online book search systems ignore UGC, our experi- to retrieve than books that are topically relevant. The Title field ments show that this content can improve both traditional ad hoc is the most effective of the non-UGC fields. It gives better preci- retrieval effectiveness and book suggestions and that standard lan- sion and recall than the Dewey and Subject fields across all sets of guage models seem to deal well with this type of data. judgements. The Review field is more effective than the Tag field. Our results highlight the relative importance of professional meta- Note that all runs use the same queries. Even though book titles data and UGC, both for traditional known-item and ad hoc search alone provide little information about books, with the Title field as well as for book suggestions. the majority of the judged topically relevant books can be found in the top 1,000, but only a third of the suggestions. The review and Acknowledgments tag fields have high R@1000 scores for all four sets of judgements. This research was supported by the Netherlands Organization for There is something about suggestions that goes beyond topical rele- Scientific Research (NWO projects # 612.066.513, 639.072.601, vance, which the UGC fields are better able to capture. Furthermore, and 640.005.001) and by the European Community’s Seventh Frame- the retrieval system is a standard language model, which was de- work Program (FP7 2007/2013, Grant Agreement 270404). veloped to capture topical relevance. Apparently these models can also deal with other aspects of relevance. It also shows how ineffec- REFERENCES tive book search systems are if they ignore reviews. Even though [1] T. Beckers, N. Fuhr, N. Pharo, R. Nordlie, and K. N. Fachry. there are many short, vague and unhelpful reviews, there seems to Overview and Results of the INEX 2009 Interactive Track. In ECDL, be enough useful content to substantially improve retrieval. This volume 6273 of LNCS, pages 409–412. Springer, 2010. is different from general web search, where low quality and spam [2] M. Koolen, J. Kamps, and G. Kazai. Social book search: Comparing documents need to be dealt with. topical relevance judgements and book suggestions for evaluation. In Proceedings of the 21st ACM Conference on Information and Knowledge Management (CIKM 2012). ACM Press, New York NY, 4. USER-CENTERED ANALYSIS 2012. The MTurk workers answered questions on which part of the metadata is more useful to determine topical relevance and which