Social Book Search: The Impact of Professional and
            User-Generated Content on Book Suggestions

                                    Marijn Koolen1           Jaap Kamps1          Gabriella Kazai2
                                              1
                                                  University of Amsterdam, The Netherlands
                                                   2
                                                     Microsoft Research, Cambridge UK,


ABSTRACT                                                                 prefer professional or UGC for judging topical relevance and for
The Web and social media give us access to a wealth of informa-          recommendation, and how standard IR models cope with UGC.
tion, not only different in quantity but also in character—traditional
descriptions from professionals are now supplemented with user           2.   SOCIAL SEARCH FOR BEST BOOKS
generated content. This challenges modern search systems based               In this section we detail collection and the LT forum topics.
on the classical model of topical relevance and ad hoc search. We            Collection The Amazon/LT collection [1] consists of 2.8 million
compare classical IR with social book search in the context of the       book records from Amazon, identified by ISBN, extended with so-
LibraryThing discussion forums where members ask for book sug-           cial metadata from LT, marked up in XML. These records contain
gestions. This paper is an compressed version of [2].                    title information, Dewey classification codes and Subject headings
                                                                         supplied by Amazon. The reviews and tags were limited to the first
Categories and Subject Descriptors: H.3.3 [Information Storage and       50 reviews and 100 tags respectively during crawling. The profes-
Retrieval]: Information Search and Retrieval—Search process)             sional metadata is more evenly distributed than the UGC. Books
General Terms: Experimentation, Measurement, Performance                 have a single classification code and most have one or two subject
Keywords: Book search, User-generated content, Evaluation                headings, although a small fraction has no professional metadata.
                                                                         Typical of UGC, popular books have many tags and reviews while
                                                                         many others have few or none. The median number of reviews and
1.      INTRODUCTION                                                     tags are 0 and 5 respectively. That is, the majority has no reviews
   The web gives access to a wealth of information that is different     but at least a handful of tags.
from traditional collections both in quantity and in character. Espe-        Topics LibraryThing users discuss their books in forums dedi-
cially through social media, there is more subjective and opinion-       cated to certain topics. Many of the topic threads are started with
ated data, which gives rise to different tasks where users are looking   a request from a member for interesting, fun new books to read.
not only for facts but also views and interpretations, which may re-     Other members often reply with links to works catalogued on LT,
quire different notions of relevance. In this paper we look at how       which we connected to books in our collection through their ISBN.
search has changed by directly comparing classical IR and social         These requests for recommendations are natural expressions of in-
search in the context of the LibraryThing (LT) discussion forums,        formation needs for a large collection of online book records, and
where members ask for book suggestions. We use a large collec-           the book suggestions are human recommendations from members
tion of book descriptions from Amazon and LT, which contain both         interested in the same topic. For the Social Search for Best Books
professional metadata and user-generated content (UGC), and com-         task we selected a set of 211 topics, some focused on fiction and
pare book suggestions on the forum with Mechanical Turk judge-           some on non-fiction books. For the Mechanical Turk experiment
ments on topical relevance and recommendation for evaluation of          we focus on a subset of 24 topics.
retrieval systems. Searchers not only consider the topical relevance         MTurk Judgements We compare the LT forum suggestions against
of a book, but also care about how interesting, well-written, re-        traditional judgements of topical relevance, as well as against rec-
cent, fun, educational or popular it is. Such affective aspects may      ommendation judgements. We set up an experiment on Amazon
be mentioned in reviews, but Amazon, LT and many similar sites           Mechanical Turk to obtain judgements on document pools based
do not include UGC in the main search index. Our main research           on top-10 pooling of the 22 runs submitted by the 4 participating
question is:                                                             groups. We designed a task to ask Mechanical Turk workers to
    • How does social book search compare to traditional search tasks?   judge the relevance of 10 books for a given book request. Apart
                                                                         from a question on topical relevance, we also asked whether they
   For this study, we set up the Social Search for Best Books (SB)       would recommend a book to the requester and which part of the
task as part of the INEX 2011 Books and Social Search Track.1 We         metadata—curated or user-generated—was more useful for deter-
want to find out whether the suggestions are complete and reliable       mining the topical relevance and for recommendation. We included
enough for retrieval evaluation and how social book search is re-        some quality assurance and control measure to deter spammers and
lated to traditional search tasks. We also want to know if users         sloppy workers. Averaged over workers the LT agreement is 0.52.
1
    https://inex.mmci.uni-saarland.de/tracks/books/
                                                                         3.   SYSTEM-CENTERED ANALYSIS
                                                                            We compare system rankings of the 22 official runs based on the
Copyright is held by the author/owner(s).
DIR 2013, April 26, 2013, Delft, The Netherlands.                        forum suggestions and on the MTurk relevance judgements. The
.                                                                        Kendall’s τ system ranking correlation between the forum sugges-
Table 1: MTurk and LT Forum evaluation (nDCG@10 and re-                        Table 2: Impact of presence of reviews and tags on judgements
call@1000) of runs over different index fields
                                                                                                                         Reviews                Tags
                                  MT urk                                                                             0 rev.  ≥1 rev.   0 tags      ≥10 tags
                  Rel             Rec           Rel&Rec         LT-Sug         Top. Rel. (Q1)     Not enough info.   0.37       0.01    0.09           0.09
     Field     nDCG recall     nDCG recall     nDCG recall     nDCG recall                        Relevant           0.30       0.54    0.49           0.48
     Title     0.212   0.601   0.260   0.545   0.172   0.591   0.055   0.350   Recommend. (Q3)    Not enough info.   0.53       0.01    0.14           0.12
     Dewey     0.000   0.009   0.003   0.007   0.000   0.005   0.001   0.022                      Rel. + Rec.        0.22       0.51    0.46           0.45
     Subject   0.016   0.008   0.021   0.010   0.016   0.009   0.003   0.009
     Review    0.579   0.720   0.786   0.756   0.542   0.783   0.251   0.680
     Tag       0.368   0.694   0.435   0.665   0.320   0.718   0.216   0.602
                                                                               part to determine whether to recommend a book. Workers could
                                                                               indicate the description does not have enough information to an-
                                                                               swer questions Q1 (topical relevance) and Q3 (recommendation).
tions for 211 topics and the MTurk judgements on the 24 topics is              We see in Table 2 the fraction of books for which workers did not
0.36. This is not due to the difference between the 211 topics of the          have enough information split over the descriptions with no reviews
forum suggestions and the subset of 24 topics selected for MTurk,              (column 2), at least one review (column 3), no tags (column 4) and
as the correlation between the forum suggestions of the 211 and                at least 10 distinct tags (column 5). First, without reviews, workers
24 topic sets is τ = 0.90. It could be that the forum suggestions              indicate they do not have enough information to determine whether
are highly incomplete. Most topics have few suggestions (median                a book is topically relevant in 37% of the cases, and label the book
is 7). If the suggestions are a small fraction of all relevant books,          as relevant in 30% of the cases. When there is at least one review,
good and bad systems will perform poorly as the chances of rank-               in only 1% of the cases do workers have too little information to
ing the few suggested books above other relevant books is small.               determine topical relevance, but in 54% of the cases they label the
However, the highest MRR score among the 22 runs is 0.481. This                book as relevant. Reviews contain important information for topi-
means that on average, over 211 topics, this system returns a sug-             cal relevance. The presence of tags seems to have no effect, as the
gested book in the top 2. If this only occurs for a few topics, it             fractions are stable across books with different numbers of tags.
could be ascribed to mere coincidence, but over 211 topics, such a             We see a similar pattern for the recommendation question (Q3).
high average is unlikely due to chance. Based on this, we argue the               In summary, the presence of reviews is important for both topical
forum suggestions are relatively complete but represent a different            relevance and recommendation, while the presence and quantity of
task from the ad hoc task modelled by the topical relevance judge-             tags plays almost no role.
ments from MTurk. In [2] we also show that the forum suggestions
behave differently from known-item topics.
   Next, we created a number of our runs to compare the forum                  5.    CONCLUSIONS
suggestions against the MTurk judgements. For indexing we use                     In this paper we ventured into unknown territory by studying
Indri, Language Model, with Krovetz stemming, stopword removal                 the domain of social book search with traditional metadata com-
and default smoothing (Dirichlet, µ=2,500). The titles of the forum            plemented by a wealth of user generated descriptions. We also
topics are used as queries. In our base index, each xml element is             focused on requests and recommendations that users post in real
indexed in a separate field, to allow search on individual fields.             life based on the social recommendations of the forums. We ob-
   Generally, systems perform better on recommendation judge-                  serve that the forum suggestions are complete enough to be used as
ments (MTurk-Rec in Table 1) than on topical relevance judgments               evaluation, but they are different in nature than traditional judge-
(MTurk-Rel), and their combination (MTurk-Rel&Rec) and worst                   ments for known-item, ad hoc and recommendation tasks. Even
on the forum suggestions (LT-Sug). The suggestions seem harder                 though most online book search systems ignore UGC, our experi-
to retrieve than books that are topically relevant. The Title field            ments show that this content can improve both traditional ad hoc
is the most effective of the non-UGC fields. It gives better preci-            retrieval effectiveness and book suggestions and that standard lan-
sion and recall than the Dewey and Subject fields across all sets of           guage models seem to deal well with this type of data.
judgements. The Review field is more effective than the Tag field.                Our results highlight the relative importance of professional meta-
Note that all runs use the same queries. Even though book titles               data and UGC, both for traditional known-item and ad hoc search
alone provide little information about books, with the Title field             as well as for book suggestions.
the majority of the judged topically relevant books can be found in
the top 1,000, but only a third of the suggestions. The review and             Acknowledgments
tag fields have high R@1000 scores for all four sets of judgements.            This research was supported by the Netherlands Organization for
There is something about suggestions that goes beyond topical rele-            Scientific Research (NWO projects # 612.066.513, 639.072.601,
vance, which the UGC fields are better able to capture. Furthermore,           and 640.005.001) and by the European Community’s Seventh Frame-
the retrieval system is a standard language model, which was de-               work Program (FP7 2007/2013, Grant Agreement 270404).
veloped to capture topical relevance. Apparently these models can
also deal with other aspects of relevance. It also shows how ineffec-          REFERENCES
tive book search systems are if they ignore reviews. Even though
                                                                               [1] T. Beckers, N. Fuhr, N. Pharo, R. Nordlie, and K. N. Fachry.
there are many short, vague and unhelpful reviews, there seems to                  Overview and Results of the INEX 2009 Interactive Track. In ECDL,
be enough useful content to substantially improve retrieval. This                  volume 6273 of LNCS, pages 409–412. Springer, 2010.
is different from general web search, where low quality and spam               [2] M. Koolen, J. Kamps, and G. Kazai. Social book search: Comparing
documents need to be dealt with.                                                   topical relevance judgements and book suggestions for evaluation. In
                                                                                   Proceedings of the 21st ACM Conference on Information and
                                                                                   Knowledge Management (CIKM 2012). ACM Press, New York NY,
4.      USER-CENTERED ANALYSIS                                                     2012.
 The MTurk workers answered questions on which part of the
metadata is more useful to determine topical relevance and which