-

Comparing Topic Representations for Social Book Search

Marijn Koolen

Hugo Huurdeman

0 2

Jaap Kamps

P@10 0 1 2 0 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 1 ISLA, Faculty of Science, University of Amsterdam 2 Institute for Logic, Language and Computation, University of Amsterdam

In this paper we describe our participation in the INEX 2013 Social Book Search Track. We compare the impact of di erent query representations for book search topics derived from the LibraryThing discussion forums, including the title and full narrative provided by the topic creator, the name of the discussion group in which the topic was posted, and a mediated search query provided by a trained annotator. Our ndings are that 1) the mediated queries are short and do not improve performance over the titles, but combining titles and mediated queries does, 2) the discussion group name adds relevant new terms to the representation and further improves performance, but adding the narrative is not e ective, and 3) for the majority of topics retrieval e ectiveness is the same across all topic representations. Our ndings suggest that writing a good search query for the complex information needs in social book search is far from trivial, even for trained annotators.

For the INEX 2013 Social Book Search Track we focused our attention on query representations. The search topics in this track are based on discussion threads from the LibraryThing (LT) discussion forums and contain both the title of the topic threads, the narrative in the rst message of the thread and a mediated query created by a trained annotator. The latter one is provided by the track organisers to compensate for non-representative thread titles for some of the forum topics.

The topic statements of the SBS Track contain rich representations of the book search information needs. The LT member who starts the topic thread describes her information need both in the thread title and in detail in the rst message of the thread. In addition, she choses a discussion group in which to start the thread, which broadly categorises her information need, with the aim to attract responses from LT members who are knowledgeable about relevant books and can recommend the best ones.

These di erent representations may each re ect di erent aspects of the information need. In our participation we investigate how these representations a ect retrieval. Speci cally, we want to know: { How di erent are the thread title and the mediated query and how does that a ect retrieval performance? { What is the importance of the detailed narrative, that explains the information need in detail, for representing the information need? { What is the role of the discussion group name in representing the information need?

In addition, we experiment with a document prior based on the book ratings of LT members. We crawled a large set of user pro les from LT that includes which book each member added to her catalogue and the rating she assigned to it. The average rating of a book may re ect its overall quality, in which case it could be used to push low quality and non-rated (and therefore unpopular) books down the ranking.

The paper is structured as follows. We rst discuss the di erent topic representations that are available in this year's topic set in Section 2. Then, we describe our experimental setup in Section 3 and discuss results in Section 4. Next, in Section 5 we present a per-topic analysis. In Section 6, we discuss our ndings and draw conclusions. 2

Topic Representations

The topics for the SBS task are based on topic threads on the LT discussion forums. Each thread starts with a message from the topic creator and is posted in one of the thousands of discussion groups. The 2013 topic set only contains topic threads that are started with a book search information need. The thread has a title and the rst message can be seen as a narrative of the information need. For instance, topic 25244 has title Why Republic vs. Democracy and is posted in the Political Conservatives discussion group. The narrative explains that the user wants to know more about forms of government and the logic behind choosing one or the other.

What is a good topic representation to use as a search query? The title is often a concise summary of the information need, but is not always comprehensive, especially for very detailed needs. Sometimes titles are conversational and reveal nothing about the topic of the information need, such as topic 45940 with title Request for recommendations.... The narrative explains the user is looking for books about the miracles of Jesus that are not based on the Bible. The title is a bad representation of the information need, while the narrative contains much more than just the information need. Because these titles and narrative are not intended as search engine queries, this year the task organisers provided a mediated search query with each topic, created by a trained annotator. This query is meant to be both concise and comprehensive.

We want to investigate the value of this mediated query with respect to the thread title and the narrative. Does it provide a better representation than the thread title? Does it cover all the ne-grained aspects expressed in the narrative? And what is the role of the Group name of the discussion group that the user selected? This group broadly categorises the information need with, we assume, the aim to nd LT members who are knowledgeable on books about the subject. But it may also be useful as an additional representation of the topic.

The topic set contains 386 topics and each topic has ve elds: title (T), query (Q), group (G), member and narrative (N). We ignore the member eld, which contains the name of the topic creator and is probably not useful for representing the information need. To understand some of the di erences between these elds as possible topic representations, we analyse them in terms of the number query terms they contain.

In Table 1 we see statistics on the number of query terms in (combinations of) elds, based on the text in those elds after parsing, stopword removal and Krovetz stemming. This processing corresponds to the way documents are processed before indexing. Columns 3{7 show the total number of content terms and columns 8{12 show the number of distinct terms. The title eld (T) has a mean (median) of 3.90 (4) content terms. The number of distinct terms is very similar, showing that content terms are rarely repeated in the title. There is one topic, number 28304, which has zero content terms, for which the thread title is Who am I? Why am I here?. This is a topic posted in the Amateur Historians group asking about books on exploration. Apart from the title containing only highly frequent words, it also does not re ect the information need at all. Here the mediated query, exploration books, improves the query representation. The query eld (Q) is in general somewhat shorter|the median is 3|but there is always at least one content term. This poses the question whether the mediated query is more comprehensive than the title, re ecting aspect from the narrative not covered in the title. Again, terms in the eld are rarely repeated. The combination of the T and Q elds results in an almost doubling of the number of content terms. The number of distinct query terms is lower but still higher than the number of distinct terms in either the title or query eld. This means that many but not all of content terms in the title and query overlap. It is plausible that the most relevant terms from the title are repeated in the query, which results in higher term frequencies for the most important terms. This might be bene cial for retrieval.

Next, we add the group and narrative elds to the combined title and query eld. The group adds only one or two terms on average, while the narrative adds dozens of content terms, with some repeated terms. However, the narrative usually contains some conversational language, with many content terms not directly related to the information need. It is not clear to what extent the possibly larger number of relevant content terms can increase performance and to what extent its conversational distractor terms hurt performance.

In Section 4 we discuss how these di erent elds a ect retrieval e ectiveness. 3

Experimental Setup

We used Indri [ 4 ] for indexing, removed stopwords and stemmed terms using the Krovetz stemmer. Based on the results from the 2011 Social Search for Best Books task [ 1 ] we include all the social metadata. From the Amazon/LibraryThing (A/LT) collection we use the booktitle, author name, subject headings, LT tags and Amazon user reviews for indexing. In addition, we use the Library of Congress Subject Headings (LCSH) from the catalogue records of the British Library and the Library of Congress. These subject headings are less noisy than the headings from Amazon, and there are more headings per book.

The topics are taken from the LibraryThing discussion groups and contain a title eld which contains the title of a topic thread, a group eld which contains the discussion group name and a narrative eld which contains the rst message from the topic thread. New this year is a mediated query eld, which is provided by the organisers as an additional representation of the information need and is meant to be a more precise expression of it than the thread title.

In our experiments we used di erent combinations of topic elds as queries. For the language model our baseline has default settings for Indri (Dirichlet smoothing with = 2500). We created six base runs: T : a standard LM run using only the Title eld of the topic.

Q : a standard LM run using only the Query eld of the topic.

TQ : a standard LM run using the Title and Query elds of the topic. TQG : a standard LM run using the Title, Query and Group elds of the topic. TQN : a standard LM run using the Title, Query and Narrative elds of the topic.

TQGN : a standard LM run using the Title, Query, Group and Narrative elds of the topic.

Last year we crawled a large set of user pro les from LT members and used member catalogues and book ratings to rerank retrieval results based on nearestneighbourhood recommendation. This year, we use the Bayesian average book ratings as document priors. That is, books that received ratings from LT members are boosted up the ranking with respect to books that received no ratings and books with high ratings are boosted more than books with low ratings.

To normalise the ratings, we compute the Bayesian average of all the book ratings in the top 1000 results per topic. The Bayesian Average (BA) takes into account how many users have rated a work. As more users rates the same work, the average becomes more reliable and less sensitive to outliers. We make the BA dependent on the query, such that the BA of a book is based on books related to the query. The BA of a book b is computed as:

BA(b) = n^ m^ + X r

r2R(b) n + n^ where R(b) is the set of ratings for b m^ is the average unweighted rating over all books in the top 1000 results and n^ is the average number of ratings over all the books in the top 1000.

A rating BA(b) for book b can range from 0.5 up to 5, with increments of 0.5. For books with no rating we use BA = 0. a base score of 1, for books with ratings we use 1 + BA. Each rating can be turned into a prior probability by dividing BA by the maximum rating BAmax = 5. For books with no rating this would results in a prior probability of zero. To avoid multiplying by zero, we use the Add-One smoothing method and compute the prior as:

The nal document score is then:

PBA(d) = 1 + BA(d) 1 + BAmax

SBA(d) = P (djq) PBA(d)

We submitted six runs: inex13SBS.ti qu : the TQ run. inex13SBS.ti qu gr na : the TQGN run. inex13SBS.ti.bayes avg.LT rating : the T run with the Bayes LT rating prior. inex13SBS.qu.bayes avg.LT rating : the Q run with the Bayes LT rating prior. inex13SBS.ti qu.bayes avg.LT rating : the TQ run with the Bayes LT rating prior. inex13SBS.ti qu gr na.bayes avg.LT rating : the TQGN run with the Bayes

LT rating prior.

In the next section we discuss the evaluation results of the o cial submission and separately all our own runs. 4

Results

We rst show the evaluation results over the whole topic set. Then we present a per-topic analysis of the di erences in performance between the di erent topic representations. (1) (2) (3) 1 run3.all-plus-query.all-doc- elds 2 inex13SBS.ti qu gr na.bayes avg.LT rating 2 inex13SBS.ti qu.bayes avg.LT rating 4 run1.all-topic- elds.all-doc- elds 5 inex13SBS.ti qu gr na 6 inex13SBS.ti qu 7 run ss bsqstw stop words free member free 2013 8 run ss bsqstw stop words free 2013 8 inex13SBS.qu.bayes avg.LT rating 10 inex13SBS.ti.bayes avg.LT rating mrr

map mrr % map %

This year, eight groups participated in the track submitting a total of 32 runs. Our o cial submissions are all among the top 10 systems, as shown in Table 2. The top four systems are close together in terms of performance, as are the systems on ranks ve up to nine. Our systems perform on par with the best other systems.

We show the evaluation results of our own runs in Table 3. Signi cant di erences are tested using the bootstrap method (one-tailed with 100,000 samples). Signi cance levels are 0.05 ( ), 0.01 ( ) and 0.001 ( ). In the top half of the table we see the base runs without Bayes Average ratings priors. Signi cance tests are with respect to the title-only (T) run. Somewhat surprisingly, the title-only (T) and query-only (Q) representations lead to similar performance. The mediated query does not improve the representation of the information need. However, the combination of title and mediated query (TQ) gives signi cantly better performance than either in isolation. This re ects the fact that the query is not

S(Q) S(T Q) S(T Q)

S(T )

S(Q) S(T QG) S(T QN ) S(T QGN )

S(T Q) S(T Q)

S(T Q) simply a copy of the thread title, but either adds complementary relevant terms or gives more weight to the most relevant terms by repeating them, or both.

Adding the group name to the title and query (TQG) further improves performance, re ecting the users ability to pick relevant discussion groups for their needs. However, adding the more detailed narrative hurts performance for early precision (nDCG@10, P@10 and Mean Reciprocal Rank (MRR)) while improving Mean Average Precision (MAP). It seems the narrative is not focused enough to precisely pinpoint the suggested books but its larger set of query terms does lead to better recall.

In the bottom half of Table 3 we the six runs with Bayes Average rating priors. Again, signi cant di erences are with respect to the title-only TBA. The rating priors lead to improvements on all reported measures for all six baseline runs. Among the runs with rating priors we see the same patterns as among the baseline runs. The T and Q representations lead to similar performance but their combination leads to better performance. The group name improves the topic representation but the narrative hurts early precision while improving MAP. We also tested the improvements of the prior ratings runs over their baseline forms and found that all improvements are signi cant for p < 0:001, except for the TQGN run where the improvements are signi cant for p < 0:05. This shows the reliability of the rating priors.

In sum, the title and query representations are equally e ective but complementary to each other. The group name can further improve performance while the narrative seems to add too many partly relevant and irrelevant terms. The LT ratings, if normalised by taking the Bayesian average, forms a reliable document prior probability of relevance. 5

Per-Topic Analysis

We show the per topic di erences between two runs for ndcg@10 in Table 4. The Q run has lower scores for 74 topics compared to the T run (column 2), higher scores for 69 topics and the same scores as the T run for 237 topics. These two runs are balanced, which explains why they lead to similar average scores, but the large number of topics for which the two runs get the same score suggests that in most cases the mediated query is very similar to the thread title. It also suggests that creating an e ective representation of the information need is far from trivial, even for trained annotators. Some of their mediated queries improve upon thread titles that do not or only partly re ect the often complex information needs in social book search [ 3 ]. But even more mediated queries express the search topic less well than the title created by the topic creator. Next we compare the TQ with the T and Q runs (rows 3 and 4). These are less balanced, with TQ outperforming T on 74 topics and Q on 76 topics while T outperforms TQ for only 50 topics and Q outperforms TQ for 49 topics. This explains why the combination of the two representations scores higher on average than either on its own. Because T and Q are often very similar, their combination also often results in the same score.

Finally, we compare the per topic scores of the TQ representations with the richer representations TQG, TQN and TQGN. The TQG run improves performance on more topics than on which it decreases performance, which corresponds with an improvement on the average score. The representations that include the narrative, TQN and TQGN, both worsen performance with respect to the TQ representations on more topics than on which they improve, corresponding to a drop in performance in ndcg@10. What is surprising is that including the much longer narrative in the representation does not a ect the per topic score for the majority of topics. There are several possible explanations for this. It could be that additional terms often provide the same relevance signal as the TQ terms, or introduce a random noise. Another explanations is that the TQ terms are frequently repeated in the narrative and therefore have a dominant impact on the retrieval score.

To summarise, the di erent query representations often carry the same signal, which may be because the same content terms dominate in the representations. However, it seems hard to improve upon the title created by the topic starter, but combining the concise representations of topic starter and annotator more often results in an improved representation than in a worse one. 6

Conclusion

In this paper we discussed our participation in the INEX 2013 Social Book Search Track in which we focus on the impact of di erent query representations of the information needs on retrieval e ectiveness. The LT members who start a topic thread to ask for book suggestions on the discussion forums provide multiple types of perspectives on their information needs. The thread title is a short summary, the rst message in the thread is a detailed description and the choice of the particular discussion group reveals the relevant general category of books for which they hope to nd knowledgeable members. In addition the task organisers provided mediated queries that aim to be both concise and comprehensive expressions of the information need, and that are suitable as search engine queries.

The mediated query in general slightly shorter than the thread title, and typically contains a few overlapping terms and one or a few di erent content terms. By combining the representations, the overlapping terms in the title and query|which we assume are the most relevant terms|receive extra weight.

The group name is short but also tends to add a few new terms to the representation with respect to the title and query. The narrative is much longer and adds many terms, relevant or not to the representation.

In terms of the impact of representations on retrieval e ectiveness, the title and mediated query are equally e ective. Their combination, however, leads to signi cant improvements over using the title alone, which is either due to the higher frequency of the most important terms or to the complementary content terms. However, for most topics, the title, query and their combination lead to the same retrieval performance. Adding the group name improves performance, indicating that the user selected a relevant discussion group for her information need. Adding narrative degrades performance slightly, which may be because of the addition of irrelevant or partly terms that broaden the scope of the query. These ndings suggest that creating a comprehensive and e ective topic representations that identify all the important relevance aspects in social book search information needs is not easy, even for trained annotators. Such topics often contain complex, multi-faceted aspects, which may be the reason why users turn to the forum in the rst place, as current book search systems provide limited options to express complex needs.

We also experimented with reranking results by combining the retrieval score with a prior probability based on the Bayesian average of a book's LibraryThing ratings. These average ratings provide a reliable probability of relevance and lead to signi cant improvements in performance.

In future work we will look in more detail at the overlap and complementarity of the title and mediated query and the role of term frequencies in topic representations of the complex information needs in social book search. We will also study the role of the detailed narrative and experiment with extracting the most salient additional terms to improve the topic representations. One way would be to use parsimonious language models [ 2 ] to remove common conversational terms.

Acknowledgments This research was supported by the Netherlands Organization for Scienti c Research (NWO projects # 612.066.513, 639.072.601, and 640.005.001) and by the European Communitys Seventh Framework Program (FP7 2007/2013, Grant Agreement 270404).

[1]

Andriaans ,

Koolen , and J. Kamps. The importance of document ranking and user-generated content for faceted search and book suggestions . In S. Geva,

Kamps , and R. Schenkel, editors, Focused Retrieval of Content and Structure: 10th International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2011 ), volume 7424 of LNCS . Springer, 2012 .

[2]

Hiemstra ,

Robertson , and

Zaragoza . Parsimonious language models for information retrieval . In Proceedings SIGIR 2004 , pages 178 { 185 . ACM Press, New York NY, 2004 .

[3]

Koolen ,

Kamps , and

Kazai. Social Book Search: The Impact of Professional and User-Generated Content on Book Suggestions . In Proceedings of the International Conference on Information and Knowledge Management (CIKM 2012 ). ACM, 2012 .

[4]

Strohman ,

Metzler ,

Turtle , and

W. B.

Croft . Indri: a language-model based search engine for complex queries . In Proceedings of the International Conference on Intelligent Analysis , 2005 .