BM25 for Non-Textual Modalities in Social Book Search Melanie Imhof1,2 1 Université de Neuchâtel, Neuchâtel, Switzerland 2 Zurich University of Applied Sciences, Winterthur, Switzerland imhf@zhaw.ch Abstract. The Social Book Search (SBS) lab at CLEF 2016 provides a complex test collection that gives the opportunity to experiment with retrieval methods that combine various modalities in order to achieve the best possible ranked list. We show how the idea of being ”characteristic”, which is used as the core concept in most of the weighting schemes used for textual modalities, can be applied to non-textual modalities. Our approach re-defines BM25 for the three non-textual modalities found in the SBS collection: ratings, price and number of pages. A fuzzy query is constructed from the user preferences inferred from the user’s catalog. The results are used to re-rank a textual baseline, which significantly improves the retrieval effectiveness. Keywords: BM25, non-textual modalities, fuzzy query. 1 Introduction The suggestion track of the INEX Social Book Search (SBS) at CLEF 2016 allows researchers to evaluate their methods on a multimodal collection with queries constructed from real LibraryThing user requests. For the books in the collection not only the book meta-information from Amazon (description, binding, number of pages, price etc.) is available but also user generated information such as book ratings. Also, the personal catalogs of the users are given and can be used to infer user preferences. In our SBS 2015 participation [2], we found that the user preferences can be used to improve the retrieval effectiveness, by incorporating the books read by the users in a random forest based learning to rank approach. In this partici- pation, we focus on taking the user preferences into account using a different approach. BM25 is a well known weighting scheme that has widely been used in text retrieval. It was originally developed for the English language but it has proven to be useful for other languages as well as for image retrieval [3]. We show how BM25 can be applied to the modalities ratings, price and number of pages. The BM25 scores of these non-textual modalities are then used to re-rank the textual baseline to significantly improve it. 2 Retrieval Models 2.1 Textual Models Similar to our participation in 2015, we employ a textual baseline [2] as a basis for our methods. For the textual score, we merge all textual fields of the document into a single textual index field and construct queries from the two topic fields title and request that are analogously merged into a single textual representation. Further, we use the example books mentioned by the topic creators to expand the queries with the 35 most characteristic terms. Hereby, the most characteristic terms of the example books are computed by BM25. Additionally, we filter the books already read by the topic creator from the final ranked list, since this is a hard criterion in the relevance assessments [1]. Hereby, we determine the read books from the catalog of the topic creator. 2.2 BM25 Model for Non-Textual Modalities BM25 can be described in terms of how it combines three components; the feature frequency (ff ), the document frequency (df ) and the document length normalization component [5]. Although, it was originally developed for retrieval on English language text, it has generalized well to many related tasks, such as multilingual retrieval, multimedia retrieval and others. The ff and the df make sure that ”characteristic” terms are weighed heavily. Hereby, a characteristic term is one that appears frequently in the document in consideration (ff ) and rarely in the remainder of the collection (df ). This concept of ”being character- istic” is quite general and therefore applicable to other (non)-textual modalities [4]; i.e. bag of visual words in image retrieval, locations in geographical IR or timestamps in time-aware IR. The retrieval status value (RSV) of document dj w.r.t. query q when using BM25 is defined as ff(ϕk , dj ) w(ϕk , dj ) := l (1) k1 ((1 − b) + b ∆j ) + ff(ϕk , dj )   0.5 + N − df(ϕk ) w(ϕk , q) := ff(ϕk , q) · log (2) 0.5 + df(ϕk ) X RSVBM25 (q, dj ) := w(ϕk , dj ) · w(ϕk , q), (3) ϕk ∈Φ(q)∩Φ(dj ) where k1 is the ff saturation parameter and b is the document length normaliza- tion parameter. The k1 parameter controls the amount an incremented ff will to contribute to the score. The notation used for the BM25 and its non-textual adaptions is described in Table 1. For the suggestion track of the SBS lab at CLEF 2016, we adapt BM25 for three non-textual modalities the ratings, the price and the number of pages and use it to re-rank the textual baseline. Table 1. Notation used for the BM25 for textual and non-textual modalities. D set of documents Φ(dj ) set of features representing document dj N number of documents Φ(q) set of features representing query q dj single document w(ϕk , dj ) weight of feature ϕk for document dj q single query w(ϕk , q) weight of feature ϕk for query q Φ indexing vocabulary ff(ϕk , dj ) frequency of feature ϕk for document dj ϕk single indexing feature df(ϕk ) document frequency of feature ϕk lj length of document dj ∆ average document length in number of tokens Ratings For the ratings, we do not have a per-user query information, but we assume, that in general users will prefer books with higher ratings. Therefore, we define the query in the following way Φ(q) := {1, 2, 3, 4, 5} (4) ff(ϕk , q) := ϕk . (5) With this definition, each possible rating (1-5) is part of the query, however, a rating 5 is weighted 5 times heavier than a rating 1. The definition of the feature frequencies ff(ϕk , dj ), document frequencies df(ϕk ) and document lengths lj is analogous to the definition used for text. Hence, the ff is the number of times a given rating appears in a document, the df is the number of documents that contain a given rating and the document length is the number of ratings in a document. Price For the price, we use the average price of the books that the user has already read ∆p (q) as the query information. Since an exact match of the price is not meaningful, we use a fuzzy search with ∆p (q) as the search parameter. We assume, that a user would also like books that are at most 20% cheaper and at most 30% more expensive than the average price of the books in his library. Although, we assume that generally a cheaper book is always acceptable, we still set a lower bound, because we assume that people tend to like similar kinds of books, that are usually in the same price range. The query’s set of features and feature frequencies are defined as Φ(q) := ]0.8 · ∆p (q), 1.3 · ∆p (q)[ (6) ( 1.3·∆ (q)−ϕ p k if ϕk ≥ ∆p (q) ff(ϕk , q) := ϕk0.3·∆ p (q) −1.2·∆p (q) (7) 0.2·∆p (q) if ϕk < ∆p (q). For the definition of the df, we bin the prices into bins with a quadratically increasing width as shown in Figure 1. The bin index for the price p is defined as √  p bin(p) = . (8) 2 This is based on the assumption, that with increasing prices, the tolerance for two book prices to be comparable is larger. The df is then defined as the number of documents with a price in a given bin. Since a book only has a single value for the price, the ff and the document length are always 1. Binning of Prices 12 10 8 Bin 6 4 2 ← ∆p (q) 0 0 50 100 150 200 250 300 350 400 450 500 Price Fig. 1. Binning of prices to compute the document frequencies as well as the fuzzy query range with the average price of the books of the topic creator ∆p (q). Number of Pages For the number of pages of the books, we defined the ff, df and the document length as well as the query, analogous to the price. 3 Experimental Evaluation Our goal in the experiments is to use the information present in the non-textual modalities to refine the result lists so that they reflect the users preferences. 3.1 Experimental Setup For the textual baseline we used Lucene3 for indexing and searching. For all modalities we used BM25 with a document length normalization parameter b = 0.75 and a ff saturation parameter k1 = 1.2. For the textual modalities, we used the built-in EnglishAnalyzer, which removes a small set of stopwords and stems terms using the Porter stemming algorithm. For the re-ranking, we used a linear 3 https://lucene.apache.org/core/ combination of the scores from the different modalities. Hereby, the weights of the linear combination sum up to one. rating price pages RSVBM25 = α · RSVtext BM25 + β · RSVBM25 + γ · RSVBM25 + δ · RSVBM25 (9) In order to validate the effectiveness of our approaches and to find the optimal re-ranking parameters, we used the topics and relevance assessments from SBS 2015. For our participation to INEX SBS 2016 track, we built six runs by applying different configurations (the re-ranking parameters equal to zero are omitted): – Run1: Textual baseline using BM25 with example based relevance feedback using 35 terms and read book filtering with re-ranking parameters: α = 1. – Run2: Textual baseline re-ranked with a query-independent BM25 model for ratings with re-ranking parameters: α = 0.7818, β = 0.2182. – Run3: Textual baseline re-ranked with a user catalog based BM25 model for the number of pages with re-ranking parameters: α = 0.3118, δ = 0.6882. – Run4: Textual baseline re-ranked with a user catalog based BM25 model for the price with re-ranking parameters: α = 0.2332, γ = 0.7668. – Run5: Textual baseline re-ranked with a user catalog based BM25 model for the price and the number of pages with re-ranking parameters: α = 0.2225, γ = 0.3033, δ = 0.4742. – Run6: Textual baseline re-ranked with a query-independent BM25 model for ratings and a user catalog based BM25 model for the price and the number of pages with re-ranking parameters: α = 0.265, β = 0.045, γ = 0.225, δ = 0.465. In the next section we discuss the evaluation results of our official submission. 3.2 Results and Discussion Table 2 summarizes our official results of SBS 2016 evaluated using nDCG@10 (Normalized Discounted Cumulative Gain), MRR (Mean Reciprocal Rank), MAP (Mean Average Precision) and R@1000 (Recall), with nDCG@10 being the offi- cial evaluation measure. The submitted runs using all non-textual modalities to re-rank the textual baseline (Run6), the run using the price and the number of pages (Run5) as well as the run using the number of pages (Run3) significantly improve the nDCG@10 over the textual baseline (Run1). The significance is computed using a paired randomization test [7] with significance level α = 5%. Using just the number of pages (Run3) leads to the highest nDCG@10 amongst our submitted runs. Using just the ratings for the re-ranking (Run2) increases the nDCG@10 over the textual baseline, although not significantly. Our re-ranking with the scores calculated based on the price (Run4) does not help to find a better ranked list. To further analyze the results, we also evaluated the performance of the non- textual modalities on their own. Therefore, we used the documents retrieved Table 2. Official results at SBS 2016. The runs are ranked according to nDCG@10.4 Rank Run Features nDCG@10 MRR MAP R@1000 25 Run3 text, pages 0.0674 0.1512 0.0472 0.2556 26 Run6 text, price, pages, ratings 0.0667 0.1499 0.0462 0.2556 27 Run5 text, price, pages 0.0665 0.1442 0.0461 0.2556 30 Run2 text, ratings 0.0584 0.1332 0.0419 0.2556 31 Run1 text 0.0561 0.1251 0.0396 0.2556 32 Run4 text, price 0.0542 0.1114 0.0386 0.2556 with the textual baseline and ranked them only based on the score of each non- textual modality. This will not lead to a fully textual baseline independent ranked list (e.g. the recall will not change), however it gives an indication how well they would perform on their own. Using this approach the nDCG@10 for the ratings is 0.0206, for the price it is 0.0258 and for the number of pages 0.0135. Surprisingly, we see that the price on its own results in the highest nDCG@10, although this is not reflected in the runs that combine the non-textual modalities with the textual baseline. We also trained the weights for modalities with the relevance assessments for the 2016 task, and found, that with the optimal weights, the textual baseline can also be improved by taking the price into account. Hence, the weights chosen based on the 2015 task, are not optimal. Nevertheless, the nDCG@10 for the runs using the number of pages (0.0706) and the ratings (0.0647) using optimal weights is still higher than for the run with the price (0.0596). This shows, that either there is a higher information overlap between the price and the textual modality than between the other modalities and the text, or the linear combination merging is not as effective for the price as for the others. 4 Conclusions In this paper, we described our participation to the suggestion track of the INEX SBS 2016 lab. We investigated how the weighting scheme BM25 can be applied to non-textual, continuous modalities. Therefore, we proposed a method to discretize the continuous modalities in order to define a document frequency and a fuzzy query that takes into account that the query does not require an exact match. By using our approach on the ratings, prices and number of pages, we showed that the effectiveness of the system can be significantly increased over the textual baseline using a simple linear score combination. However, the performance of our random forest based learning to rank approach from 2015, can not be reached. Our experiments, have shown that the merging the scores of the prices with the textual scores leads to a smaller improvements as could be expected based 4 We have underlined any statistically significant differences in performance according to nDCG@10 to the textual baseline (Run1) resulting from a paired randomization test [7] (significance level α = 5%). on the performance of the non-textual modalities individually. So far, we did not yet investigate the merging in more depth. It is possible that a different merging method could improve the merging with the price. For example, we could use a non-linear combination of the scores, or a per-query normalization strategy, like the z-score [6], to avoid that the per-query optimal weights are far apart. Further, we would like to investigate if the function used for the fuzzy search is the best possible. We could for example use different parameters or a non-linear falloff of the weighting. So far, we approximated the user preferences by the average price and number of pages of the books read by the user. However, it could also be possible to construct the query such that each price and number of pages is part of the query and therefore the loss of information due to the averaging is avoided. References 1. Bogers, T., Koolen, M., Jaap, K., Kazai, G., Preminger, M.: Overview of the inex 2014 social book search track. In: Conference and Labs of the Evaluation Forum. pp. 462–479 (2014) 2. Imhof, M., Badache, I., Boughanem, M.: Multimodal social book search. In: Sixth International Conference of the CLEF Association, CLEF (2015) 3. Moulin, C., Barat, C., Ducottet, C.: Fusion of tf. idf weighted bag of visual features for image classification. In: Content-Based Multimedia Indexing (CBMI), 2010 In- ternational Workshop on. pp. 1–6. IEEE (2010) 4. Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and be- yond. Now Publishers Inc (2009) 5. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information processing & management 24(5), 513–523 (1988) 6. Savoy, J., Berger, P.Y.: Accessing Multilingual Information Repositories: 6th Work- shop of the Cross-Language Evalution Forum, CLEF 2005, Vienna, Austria, 21- 23 September, 2005, Revised Selected Papers, chap. Monolingual, Bilingual, and GIRT Information Retrieval at CLEF-2005, pp. 131–140. Springer Berlin Heidel- berg, Berlin, Heidelberg (2006) 7. Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance tests for information retrieval evaluation. In: CIKM ’07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. pp. 623–632. ACM, New York, NY, USA (2007)