Comparing Topic Representations for Social
                  Book Search

              Marijn Koolen1 , Hugo Huurdeman1,2 , and Jaap Kamps1,2,3
      1
        Institute for Logic, Language and Computation, University of Amsterdam
2
    Archives and Information Studies, Faculty of Humanities, University of Amsterdam
                  3
                    ISLA, Faculty of Science, University of Amsterdam


          Abstract. In this paper we describe our participation in the INEX 2013
          Social Book Search Track. We compare the impact of different query
          representations for book search topics derived from the LibraryThing
          discussion forums, including the title and full narrative provided by the
          topic creator, the name of the discussion group in which the topic was
          posted, and a mediated search query provided by a trained annotator.
          Our findings are that 1) the mediated queries are short and do not im-
          prove performance over the titles, but combining titles and mediated
          queries does, 2) the discussion group name adds relevant new terms to
          the representation and further improves performance, but adding the
          narrative is not effective, and 3) for the majority of topics retrieval effec-
          tiveness is the same across all topic representations. Our findings suggest
          that writing a good search query for the complex information needs in
          social book search is far from trivial, even for trained annotators.


1     Introduction

For the INEX 2013 Social Book Search Track we focused our attention on query
representations. The search topics in this track are based on discussion threads
from the LibraryThing (LT) discussion forums and contain both the title of the
topic threads, the narrative in the first message of the thread and a mediated
query created by a trained annotator. The latter one is provided by the track
organisers to compensate for non-representative thread titles for some of the
forum topics.
    The topic statements of the SBS Track contain rich representations of the
book search information needs. The LT member who starts the topic thread
describes her information need both in the thread title and in detail in the first
message of the thread. In addition, she choses a discussion group in which to
start the thread, which broadly categorises her information need, with the aim
to attract responses from LT members who are knowledgeable about relevant
books and can recommend the best ones.
    These different representations may each reflect different aspects of the in-
formation need. In our participation we investigate how these representations
affect retrieval. Specifically, we want to know:
 – How different are the thread title and the mediated query and how does that
   affect retrieval performance?
 – What is the importance of the detailed narrative, that explains the informa-
   tion need in detail, for representing the information need?
 – What is the role of the discussion group name in representing the information
   need?

    In addition, we experiment with a document prior based on the book ratings
of LT members. We crawled a large set of user profiles from LT that includes
which book each member added to her catalogue and the rating she assigned
to it. The average rating of a book may reflect its overall quality, in which case
it could be used to push low quality and non-rated (and therefore unpopular)
books down the ranking.
    The paper is structured as follows. We first discuss the different topic rep-
resentations that are available in this year’s topic set in Section 2. Then, we
describe our experimental setup in Section 3 and discuss results in Section 4.
Next, in Section 5 we present a per-topic analysis. In Section 6, we discuss our
findings and draw conclusions.


2   Topic Representations

The topics for the SBS task are based on topic threads on the LT discussion
forums. Each thread starts with a message from the topic creator and is posted
in one of the thousands of discussion groups. The 2013 topic set only contains
topic threads that are started with a book search information need. The thread
has a title and the first message can be seen as a narrative of the information
need. For instance, topic 25244 has title Why Republic vs. Democracy and is
posted in the Political Conservatives discussion group. The narrative explains
that the user wants to know more about forms of government and the logic
behind choosing one or the other.
    What is a good topic representation to use as a search query? The title is often
a concise summary of the information need, but is not always comprehensive,
especially for very detailed needs. Sometimes titles are conversational and reveal
nothing about the topic of the information need, such as topic 45940 with title
Request for recommendations.... The narrative explains the user is looking for
books about the miracles of Jesus that are not based on the Bible. The title
is a bad representation of the information need, while the narrative contains
much more than just the information need. Because these titles and narrative
are not intended as search engine queries, this year the task organisers provided
a mediated search query with each topic, created by a trained annotator. This
query is meant to be both concise and comprehensive.
    We want to investigate the value of this mediated query with respect to the
thread title and the narrative. Does it provide a better representation than the
thread title? Does it cover all the fine-grained aspects expressed in the narrative?
And what is the role of the Group name of the discussion group that the user
Table 1: Statistics on the number of words per topic representation for different
combinations on topic fields (T = thread title, Q = mediated query, G = group
name and N = narrative).
                              Total words                   Distinct words
  Fields # tpcs     min. max. med. mean std.dev      min. max. med. mean std.dev
  T           386       0 12       4 3.90     2.07       0 12       4 3.88     2.04
  Q           386       1 10       3 3.61     1.55       1   9      3 3.56     1.47
  TQ          386       2 19       7 7.51     3.00       1 15       5 5.77     2.46
  T QG        386       4 20       9 9.46     2.86       2 16       7 7.19     2.39
  T QN        386       4 257     43 53.68   39.20       3 179     32 39.93   27.96
  T QGN       386       6 259     44 55.63   39.04       4 179     32 41.03   27.69

selected? This group broadly categorises the information need with, we assume,
the aim to find LT members who are knowledgeable on books about the subject.
But it may also be useful as an additional representation of the topic.
    The topic set contains 386 topics and each topic has five fields: title (T), query
(Q), group (G), member and narrative (N). We ignore the member field, which
contains the name of the topic creator and is probably not useful for representing
the information need. To understand some of the differences between these fields
as possible topic representations, we analyse them in terms of the number query
terms they contain.
    In Table 1 we see statistics on the number of query terms in (combinations
of) fields, based on the text in those fields after parsing, stopword removal and
Krovetz stemming. This processing corresponds to the way documents are pro-
cessed before indexing. Columns 3–7 show the total number of content terms
and columns 8–12 show the number of distinct terms. The title field (T) has a
mean (median) of 3.90 (4) content terms. The number of distinct terms is very
similar, showing that content terms are rarely repeated in the title. There is one
topic, number 28304, which has zero content terms, for which the thread title is
Who am I? Why am I here?. This is a topic posted in the Amateur Historians
group asking about books on exploration. Apart from the title containing only
highly frequent words, it also does not reflect the information need at all. Here
the mediated query, exploration books, improves the query representation. The
query field (Q) is in general somewhat shorter—the median is 3—but there is
always at least one content term. This poses the question whether the mediated
query is more comprehensive than the title, reflecting aspect from the narrative
not covered in the title. Again, terms in the field are rarely repeated. The com-
bination of the T and Q fields results in an almost doubling of the number of
content terms. The number of distinct query terms is lower but still higher than
the number of distinct terms in either the title or query field. This means that
many but not all of content terms in the title and query overlap. It is plausible
that the most relevant terms from the title are repeated in the query, which
results in higher term frequencies for the most important terms. This might be
beneficial for retrieval.
    Next, we add the group and narrative fields to the combined title and query
field. The group adds only one or two terms on average, while the narrative
adds dozens of content terms, with some repeated terms. However, the narrative
usually contains some conversational language, with many content terms not
directly related to the information need. It is not clear to what extent the possibly
larger number of relevant content terms can increase performance and to what
extent its conversational distractor terms hurt performance.
    In Section 4 we discuss how these different fields affect retrieval effectiveness.


3    Experimental Setup
We used Indri [4] for indexing, removed stopwords and stemmed terms us-
ing the Krovetz stemmer. Based on the results from the 2011 Social Search
for Best Books task [1] we include all the social metadata. From the Ama-
zon/LibraryThing (A/LT) collection we use the booktitle, author name, subject
headings, LT tags and Amazon user reviews for indexing. In addition, we use
the Library of Congress Subject Headings (LCSH) from the catalogue records of
the British Library and the Library of Congress. These subject headings are less
noisy than the headings from Amazon, and there are more headings per book.
     The topics are taken from the LibraryThing discussion groups and contain a
title field which contains the title of a topic thread, a group field which contains
the discussion group name and a narrative field which contains the first message
from the topic thread. New this year is a mediated query field, which is provided
by the organisers as an additional representation of the information need and is
meant to be a more precise expression of it than the thread title.
     In our experiments we used different combinations of topic fields as queries.
For the language model our baseline has default settings for Indri (Dirichlet
smoothing with µ = 2500). We created six base runs:
T : a standard LM run using only the Title field of the topic.
Q : a standard LM run using only the Query field of the topic.
TQ : a standard LM run using the Title and Query fields of the topic.
TQG : a standard LM run using the Title, Query and Group fields of the topic.
TQN : a standard LM run using the Title, Query and Narrative fields of the
   topic.
TQGN : a standard LM run using the Title, Query, Group and Narrative fields
   of the topic.
    Last year we crawled a large set of user profiles from LT members and used
member catalogues and book ratings to rerank retrieval results based on nearest-
neighbourhood recommendation. This year, we use the Bayesian average book
ratings as document priors. That is, books that received ratings from LT mem-
bers are boosted up the ranking with respect to books that received no ratings
and books with high ratings are boosted more than books with low ratings.
    To normalise the ratings, we compute the Bayesian average of all the book
ratings in the top 1000 results per topic. The Bayesian Average (BA) takes into
account how many users have rated a work. As more users rates the same work,
the average becomes more reliable and less sensitive to outliers. We make the BA
dependent on the query, such that the BA of a book is based on books related
to the query. The BA of a book b is computed as:
                                              X
                                    n̂ · m̂ +       r
                                               r∈R(b)
                             BA(b) =                                           (1)
                                            n + n̂
where R(b) is the set of ratings for b m̂ is the average unweighted rating over all
books in the top 1000 results and n̂ is the average number of ratings over all the
books in the top 1000.
    A rating BA(b) for book b can range from 0.5 up to 5, with increments of
0.5. For books with no rating we use BA = 0. a base score of 1, for books with
ratings we use 1 + BA. Each rating can be turned into a prior probability by
dividing BA by the maximum rating BAmax = 5. For books with no rating this
would results in a prior probability of zero. To avoid multiplying by zero, we use
the Add-One smoothing method and compute the prior as:

                                          1 + BA(d)
                              PBA (d) =                                        (2)
                                          1 + BAmax
    The final document score is then:

                             SBA (d) = P (d|q) · PBA (d)                       (3)

    We submitted six runs:

inex13SBS.ti qu : the TQ run.
inex13SBS.ti qu gr na : the TQGN run.
inex13SBS.ti.bayes avg.LT rating : the T run with the Bayes LT rating
   prior.
inex13SBS.qu.bayes avg.LT rating : the Q run with the Bayes LT rating
   prior.
inex13SBS.ti qu.bayes avg.LT rating : the TQ run with the Bayes LT rat-
   ing prior.
inex13SBS.ti qu gr na.bayes avg.LT rating : the TQGN run with the Bayes
   LT rating prior.

In the next section we discuss the evaluation results of the official submission
and separately all our own runs.


4    Results

We first show the evaluation results over the whole topic set. Then we present a
per-topic analysis of the differences in performance between the different topic
representations.
Table 2: Evaluation results of the top 10 runs of the INEX 2013 SBS task. Our
runs are in italics
Rank Run ID                                              ndcg@10 P@10        mrr    map
    1 run3.all-plus-query.all-doc-fields                    0.1361 0.0653 0.2286 0.0861
    2 inex13SBS.ti qu gr na.bayes avg.LT rating             0.1331 0.0771 0.2342 0.0788
    2 inex13SBS.ti qu.bayes avg.LT rating                   0.1331 0.0771 0.2342 0.0788
    4 run1.all-topic-fields.all-doc-fields                  0.1295 0.0647 0.2190 0.0797
    5 inex13SBS.ti qu gr na                                 0.1184 0.0555 0.2075 0.0790
    6 inex13SBS.ti qu                                       0.1163 0.0647 0.2091 0.0665
    7 run ss bsqstw stop words free member free 2013        0.1150 0.0479 0.1839 0.0800
    8 run ss bsqstw stop words free 2013                    0.1147 0.0468 0.1843 0.0798
    8 inex13SBS.qu.bayes avg.LT rating                      0.1147 0.0661 0.1997 0.0656
   10 inex13SBS.ti.bayes avg.LT rating                      0.1095 0.0634 0.2005 0.0630


Table 3: Evaluation results of our runs in the INEX 2013 SBS task. Significance
levels are 0.05 (◦ ), 0.01 (◦•) and 0.001 (• ).
      Run id     ndcg@10 %         P@10 %           mrr %          map     %
      T          0.094           0.053            0.190         0.066
      Q          0.097     2.6% 0.054       1.3% 0.187 -1.3% 0.065 -1.7%
      TQ         0.116◦•   23.3% 0.065•     21.6% 0.220◦• 15.9% 0.082◦• 24.2%
      T QG       0.120•    27.4% 0.068•     27.1% 0.225◦• 18.7% 0.084◦• 28.3%
      T QN       0.115◦    21.6% 0.052      -3.0% 0.204 7.9% 0.089◦• 35.8%
      T QGN      0.118◦    25.6% 0.056      4.3% 0.217 14.5% 0.093◦• 40.9%
      T PBA   0.110              0.063            0.209         0.077
      QBA     0.115        4.8% 0.066       4.3% 0.209 0.1% 0.079 2.9%
      T QBA   0.133•       21.6% 0.077•     21.6% 0.244◦• 16.7% 0.095• 23.4%
      T QGBA 0.135•        23.4% 0.077•     22.1% 0.247◦• 18.3% 0.096• 24.1%
      T QNBA 0.132◦        20.5% 0.063      -0.3% 0.237 13.6% 0.102◦• 31.9%
      T QGNBA 0.132◦       20.6% 0.067      5.4% 0.235 12.7% 0.100◦• 29.1%

   This year, eight groups participated in the track submitting a total of 32
runs. Our official submissions are all among the top 10 systems, as shown in
Table 2. The top four systems are close together in terms of performance, as are
the systems on ranks five up to nine. Our systems perform on par with the best
other systems.
   We show the evaluation results of our own runs in Table 3. Significant differ-
ences are tested using the bootstrap method (one-tailed with 100,000 samples).
Significance levels are 0.05 (◦ ), 0.01 (◦•) and 0.001 (• ). In the top half of the table
we see the base runs without Bayes Average ratings priors. Significance tests are
with respect to the title-only (T) run. Somewhat surprisingly, the title-only (T)
and query-only (Q) representations lead to similar performance. The mediated
query does not improve the representation of the information need. However,
the combination of title and mediated query (TQ) gives significantly better per-
formance than either in isolation. This reflects the fact that the query is not
            Table 4: Per topic differences in ndcg@10 between runs
                                                   # topics
                                               ↓      =     ↑
                      S(Q) − S(T )            74     237   69
                      S(T Q) − S(T )          50     256   74
                      S(T Q) − S(Q)           49     255   76
                      S(T QG) − S(T Q)        53     257   70
                      S(T QN ) − S(T Q)       84     222   74
                      S(T QGN ) − S(T Q)      81     220   79

simply a copy of the thread title, but either adds complementary relevant terms
or gives more weight to the most relevant terms by repeating them, or both.
    Adding the group name to the title and query (TQG) further improves per-
formance, reflecting the users ability to pick relevant discussion groups for their
needs. However, adding the more detailed narrative hurts performance for early
precision (nDCG@10, P@10 and Mean Reciprocal Rank (MRR)) while improv-
ing Mean Average Precision (MAP). It seems the narrative is not focused enough
to precisely pinpoint the suggested books but its larger set of query terms does
lead to better recall.
    In the bottom half of Table 3 we the six runs with Bayes Average rating
priors. Again, significant differences are with respect to the title-only TBA . The
rating priors lead to improvements on all reported measures for all six baseline
runs. Among the runs with rating priors we see the same patterns as among the
baseline runs. The T and Q representations lead to similar performance but their
combination leads to better performance. The group name improves the topic
representation but the narrative hurts early precision while improving MAP. We
also tested the improvements of the prior ratings runs over their baseline forms
and found that all improvements are significant for p < 0.001, except for the
TQGN run where the improvements are significant for p < 0.05. This shows the
reliability of the rating priors.
    In sum, the title and query representations are equally effective but com-
plementary to each other. The group name can further improve performance
while the narrative seems to add too many partly relevant and irrelevant terms.
The LT ratings, if normalised by taking the Bayesian average, forms a reliable
document prior probability of relevance.


5   Per-Topic Analysis
We show the per topic differences between two runs for ndcg@10 in Table 4.
The Q run has lower scores for 74 topics compared to the T run (column 2),
higher scores for 69 topics and the same scores as the T run for 237 topics.
These two runs are balanced, which explains why they lead to similar average
scores, but the large number of topics for which the two runs get the same score
suggests that in most cases the mediated query is very similar to the thread
title. It also suggests that creating an effective representation of the information
need is far from trivial, even for trained annotators. Some of their mediated
queries improve upon thread titles that do not or only partly reflect the often
complex information needs in social book search [3]. But even more mediated
queries express the search topic less well than the title created by the topic
creator. Next we compare the TQ with the T and Q runs (rows 3 and 4). These
are less balanced, with TQ outperforming T on 74 topics and Q on 76 topics
while T outperforms TQ for only 50 topics and Q outperforms TQ for 49 topics.
This explains why the combination of the two representations scores higher on
average than either on its own. Because T and Q are often very similar, their
combination also often results in the same score.
    Finally, we compare the per topic scores of the TQ representations with the
richer representations TQG, TQN and TQGN. The TQG run improves perfor-
mance on more topics than on which it decreases performance, which corresponds
with an improvement on the average score. The representations that include the
narrative, TQN and TQGN, both worsen performance with respect to the TQ
representations on more topics than on which they improve, corresponding to a
drop in performance in ndcg@10. What is surprising is that including the much
longer narrative in the representation does not affect the per topic score for the
majority of topics. There are several possible explanations for this. It could be
that additional terms often provide the same relevance signal as the TQ terms,
or introduce a random noise. Another explanations is that the TQ terms are
frequently repeated in the narrative and therefore have a dominant impact on
the retrieval score.
    To summarise, the different query representations often carry the same signal,
which may be because the same content terms dominate in the representations.
However, it seems hard to improve upon the title created by the topic starter,
but combining the concise representations of topic starter and annotator more
often results in an improved representation than in a worse one.


6   Conclusion

In this paper we discussed our participation in the INEX 2013 Social Book
Search Track in which we focus on the impact of different query representations
of the information needs on retrieval effectiveness. The LT members who start
a topic thread to ask for book suggestions on the discussion forums provide
multiple types of perspectives on their information needs. The thread title is a
short summary, the first message in the thread is a detailed description and the
choice of the particular discussion group reveals the relevant general category
of books for which they hope to find knowledgeable members. In addition the
task organisers provided mediated queries that aim to be both concise and com-
prehensive expressions of the information need, and that are suitable as search
engine queries.
    The mediated query in general slightly shorter than the thread title, and
typically contains a few overlapping terms and one or a few different content
terms. By combining the representations, the overlapping terms in the title and
query—which we assume are the most relevant terms—receive extra weight.
    The group name is short but also tends to add a few new terms to the
representation with respect to the title and query. The narrative is much longer
and adds many terms, relevant or not to the representation.
    In terms of the impact of representations on retrieval effectiveness, the title
and mediated query are equally effective. Their combination, however, leads to
significant improvements over using the title alone, which is either due to the
higher frequency of the most important terms or to the complementary content
terms. However, for most topics, the title, query and their combination lead to
the same retrieval performance. Adding the group name improves performance,
indicating that the user selected a relevant discussion group for her information
need. Adding narrative degrades performance slightly, which may be because of
the addition of irrelevant or partly terms that broaden the scope of the query.
These findings suggest that creating a comprehensive and effective topic repre-
sentations that identify all the important relevance aspects in social book search
information needs is not easy, even for trained annotators. Such topics often
contain complex, multi-faceted aspects, which may be the reason why users turn
to the forum in the first place, as current book search systems provide limited
options to express complex needs.
    We also experimented with reranking results by combining the retrieval score
with a prior probability based on the Bayesian average of a book’s LibraryThing
ratings. These average ratings provide a reliable probability of relevance and
lead to significant improvements in performance.
    In future work we will look in more detail at the overlap and complementarity
of the title and mediated query and the role of term frequencies in topic repre-
sentations of the complex information needs in social book search. We will also
study the role of the detailed narrative and experiment with extracting the most
salient additional terms to improve the topic representations. One way would
be to use parsimonious language models [2] to remove common conversational
terms.

Acknowledgments This research was supported by the Netherlands Organiza-
tion for Scientific Research (NWO projects # 612.066.513, 639.072.601, and
640.005.001) and by the European Communitys Seventh Framework Program
(FP7 2007/2013, Grant Agreement 270404).


Bibliography

[1] F. Andriaans, M. Koolen, and J. Kamps. The importance of document rank-
    ing and user-generated content for faceted search and book suggestions. In
    S. Geva, J. Kamps, and R. Schenkel, editors, Focused Retrieval of Content
    and Structure: 10th International Workshop of the Initiative for the Evalua-
    tion of XML Retrieval (INEX 2011), volume 7424 of LNCS. Springer, 2012.
[2] D. Hiemstra, S. Robertson, and H. Zaragoza. Parsimonious language models
    for information retrieval. In Proceedings SIGIR 2004, pages 178–185. ACM
    Press, New York NY, 2004.
[3] M. Koolen, J. Kamps, and G. Kazai. Social Book Search: The Impact of Pro-
    fessional and User-Generated Content on Book Suggestions. In Proceedings
    of the International Conference on Information and Knowledge Management
    (CIKM 2012). ACM, 2012.
[4] T. Strohman, D. Metzler, H. Turtle, and W. B. Croft. Indri: a language-model
    based search engine for complex queries. In Proceedings of the International
    Conference on Intelligent Analysis, 2005.