Cross-Document Search Engine For Book Recommendation Chahinez Benkoussas Aix-Marseille Université, CNRS, LSIS UMR 7296 13397, Marseille. France chahinez.benkoussas@lsis.org Aix-Marseille Université, CNRS, CLEO OpenEdition UMS 3287, 13451 13397, Marseille. France chahinez.benkoussas@openedition.org Patrice Bellot Aix-Marseille Université, CNRS, LSIS UMR 7296 13397, Marseille. France patrice.bellot@lsis.org Aix-Marseille Université, CNRS, CLEO OpenEdition UMS 3287, 13451 13397, Marseille. France patrice.bellot@openedition.org ABSTRACT There has been much work both in the industry and academia A new combination of multiple Information Retrieval ap- on developing new approaches to improve the performance of proaches are proposed for book recommendation based on retrieval and recommendation systems over the last decade. complex users’ queries. We used different theoretical re- The aim is to help users to deal with information over- trieval models: probabilistic as InL2 (Divergence From Ran- load and provide recommendation for books, restaurants or domness model) and language models and tested their in- movies. Some vendors have incorporated recommendation terpolated combination. We considered the application of a capabilities into their commerce services, such as Amazon. graph based algorithm in a new retrieval approach to related document network comprised of social links. We called Di- rected Graph of Documents (DGD) a network constructed Existing document retrieval approaches need to be improved with documents and social information provided from each to satisfy users’ information needs. Most systems use clas- one of them. Specifically, this work tackles the problem of sic information retrieval models, such as language models or book recommendation in the context of CLEF Labs pre- probabilistic models. Language models have been applied cisely Social Book Search track. We established a specific with a high degree of success in information retrieval applica- strategy for queries searching after separating query set into tions [29–31]. This was first introduced by Ponte and Croft two genres “Analogue” and “Non-Analogue” after analyzing in [27]. They proposed a method to score documents, called users’ needs. Series of reranking experiments demonstrate query likelihood in two steps: estimate a language model that combining retrieval models and exploiting linked docu- for each document and then rank documents according to ments for retrieving yield significant improvements in terms the likelihood scores resulting from the estimated language of standard ranked retrieval metrics. These results extend model. Markov Random Field model, proposed by Metzler the applicability of link analysis algorithms to different en- and Croft in [19] considers query term proximity in docu- vironments. ments by estimating term dependencies in the context of lan- guage modeling approach. Alternatively, Divergence From Randomness model, proposed by Amati and Van Rijsber- Keywords gen [2], measures the global informativeness of the term in Document retrieval, InL2, language model, book recommen- the document collection. It is based on the idea :“The more dation, PageRank, graph modeling, Social Book Search. the term occurrences diverge from random throughout the collection, the more informative the term is” [28]. One limit 1. INTRODUCTION of such models is that the distance between query terms in documents is not considered. Users’ queries differ by their type of needs. In book recom- mendation, we identified two genres of queries : “Analogue” and “Non-Analogue” that we describe in the following sec- tions. In this paper, the first proposed approach combines probabilistic and language models to improve the retrieval CBRecSys 2015, September 20, 2015, Vienna, Austria. performances and show that the two models act much better Copyright remains with the authors and/or original copyright holders in the context of book recommendation. Social Book Search (SBS) task1 aims to evaluate the value of professional and user’s metadata for book search on the In recent years, an important innovation in information re- Web. The main goal is to exploit search techniques to deal trieval is the exploitation of relationships between docu- with complex information needs and complex information ments, e.g. Google’s PageRank [25]. It has been success- sources that include user profiles, personal catalogs, and ful in Web environments, where the relationships are pro- book descriptions. vided by hyperlinks between documents. We present a new approach for linking documents to construct a graph struc- The SBS task provides a collection of 2.8 million book de- ture that is used in retrieving process. In this approach, scription crawled by the University of Duisburg-Essen from we exploit the PageRank algorithm for ranking documents Amazon2 [4] and enriched with content from LibraryThing3 , with respect to users’ queries. In the absence of manually- which is an online service to help people catalog their books created hyperlinks, we use social information to create a easly. Books are stored in XML files and identified by an Directed Graph of Documents (DGD) and argue that it can ISBN. They contains information like: title information, be treated in the same manner as hyperlink graphs. Our Dewey Decimal Classification (DDC) code (for 61% of the experiments will show that incorporating graph analysis al- books), category, Amazon product description, etc. Ama- gorithms in document retrieval improves the performance in zon records contain also social information generated by term of the standard ranked retrieval metrics. users like: tags, reviews, ratings (see Figure 1. For each book, Amazon suggests a set of “Similar Products” which Our work focuses on search in the book recommendation do- represents a result of computed similarity based on content main, in the context of CLEF Labs Social Book Search track. information and user behavior (purchases, likes, reviews, We tested our approaches on collection contains Amazon/Li- etc.) [13]. braryThing book descriptions and set of queries, called top- ics, extracted from the LibraryThing discussion forums. 2. RELATED WORK This work is first related to the area of document retrieval models, more specially language models and probabilistic models. The unigram language models are most often used for ad hoc Information Retrieval work but several researchers explored the use of language modeling for capturing higher order dependencies between terms. Bouchard and Nie in [8] showed significant improvements in retrieval effectiveness with a new statistical language model for the query based on completing the query by terms in the user’s domain of inter- est, reordering the retrieval results or expanding the query using lexical relations extracted from the user’s domain of interest. Divergence From Randomness (DFR) is one of several prob- abilistic models that we have used in our work. Abolhassani and Fuhr have investigated several possibilities for apply- ing Amati’s DFR model [2] for content-only search in XML documents. [1]. There has been an increasing use of techniques based on Figure 1: Example of book from the Amazon/LibraryThing graphs constructed by implicit relationships between doc- collection in XML format uments. Kurland and Lee performed structural reranking based on centrality measures in graph of documents which SBS task provides a set of queries called topics where users has been generated using relationships between documents describe what they are looking for (books for a particular based on language models [14]. In [16], Lin demonstrates the genre, books of particular authors, similar books to those possibility to exploit document networks defined by automatically- that have been already read, etc.). These requests for rec- generated content-similarity links for document retrieval in ommendations are natural expressions of information needs the absence of explicit hyperlinks. He integrates the PageR- for a large collection of online book records. The topics are ank scores with standard retrieval score and shows a signifi- crawled from LibraryThing discussion Forums. cant improvement in ranked retrieval performance. His work was focused on search in the biomedical domain, in the con- The topic set consists of 680 topics in 2014. Each topic has text of PubMed search engine. Perhaps the main contrast a narrative description of the information need and other with our work is that links were not induced by generation fields as illustrated in Figure 2. probabilities or linguistic items. 1 http://social-book-search.humanities.uva.nl/ 3. INEX SOCIAL BOOK SEARCH TRACK 2 http://www.amazon.com/ AND TEST COLLECTION 3 http://www.librarything.com/ to the queries using Indri4 Query Language5 . 4.3 Combining Search Systems Combining the output of many search systems, in contrast to using just a single one improves the retrieval effectiveness as proved in [5] where Belkin combined the results of probabilis- tic with vector space models. On the basis of this approach, In our work, we combined the probabilistic model, InL2 with language model SDM. This combination takes into account Figure 2: Example of topic, composed with multiple fields both the informativeness of query terms and their depen- to describe user’s need(s) dencies in the document collection. Each retrieval model uses different weighting schemes therefore the scores should be normalized. We used the maximum and minimum scores according to Lee’s formula [15]. 4. RETRIEVAL MODELS This section describes the retrieval models we used for book recommendation and their combination. oldScore − minScore normalizedScore = maxScore − minScore 4.1 InL2 of Divergence From Randomness We used InL2, Inverse Document Frequency model with It has been shown in [6] that InL2 and SDM models have Laplace after-effect and normalization 2. This model has different levels of retrieval effectiveness, thus it is necessary been used with success in different works [3,6,10,26]. InL2 is to weight individual model scores depending on their overall a DFR-based model (Divergence From Randomness) based performance. We used an interpolation parameter (α) that on the Geometric distribution and Laplace law of succession. we varied to improve retrieval effectiveness. 5. GRAPH MODELING 4.2 Sequential dependence Model of Markov In [17], the authors have exploited networks defined by automatically- Random Field generated content-similarity links for document retrieval. Language models are largely used in Document Retrieval We provided document analysis to find new way to link search for book recommendation [6, 7]. Metzler and Croft them. In our case, we exploited a special type of similar- proposed Markov Random Field (MRF) model [18, 20] that ity based on several factors. This similarity is provided by integrates multi-word phrases in the query. Specifically, we Amazon and corresponds to “Similar Products” given gener- used the Sequential Dependence Model (SDM), which is a ally for each book. The degree of similarity depends on social special case of MRF. In this models co-occurrence of query information like: number of clicks or purchases and content- terms is taken into consideration. SDM builds upon this idea based information like book attributes (book description, by considering combinations of query terms with proximity book title, etc.). The exact formula used by Amazon to com- constraints which are: single term features (standard uni- bine social and content based information to compute sim- gram language model features, fT ), exact phrase features ilarity is proprietary. The idea behind this linking method (words appearing in sequence, fO ) and unordered window is that documents linked with such type of similarity, the features (require words to be close together, but not neces- probability that they are in the same context is higher than sarily in an exact sequence order, fU ). if they are not connected. Finally, documents are ranked according to the following To perform data modeling into DGD, we extracted the “Sim- scoring function: ilar Products” links between documents in order to con- struct the graph structure. Once used it to enrich results from the retrieval models, in the same spirit as pseudo- relevance-feedback. Each node in the DGD represents doc- X SDM (Q, D) = λT fT (q, D)+ q∈Q ument (Amazon description of book), and has set of prop- erties: |Q|−1 X +λO fO (qi , qi + 1, D) i=1 • ID: book’s ISBN |Q|−1 X • content : book description that include many other +λU fU (qi , qi + 1, D) properties (title, product description, author(s), users’ i=1 tags, content of reviews, etc.) Where feature weights are set based on the authorâĂŹs rec- • M eanRating : average of ratings attributed to the ommendation (λT = 0.85, λO = 0.1, λU = 0.05) in [7]. fT book , fO and fU are the log maximum likelihood estimates of 4 http://www.lemurproject.org/indri/ query terms in document D, computed over the target col- 5 http://www.lemurproject.org/lemur/ lection using a Dirichlet smoothing. We applied this model IndriQueryLanguage.php • P R : book’s PageRank StartingN ode identifies a document from Dinit used as in- put to the graph processing algorithms in the DGD. The set of documents present in the graph is denoted by S. Dti Edges in the DGD are directed and correspond to Amazon indicates the documents retrieved for topic ti ∈ T . similarity, so given nodes {A, B} ∈ S , if A points to B, B is suggested as Similar Product to A. In the Figure 3, we show an example of DGD, network of documents. The 5.1 Our Approach DGD network contains 1 645 355 nodes (89.86% of nodes are The DGD network contains useful information about doc- within the collection and the rest are outside) and 6 582 258 uments that can be exploited for document retrieval. Our edges. approach is based, first on results of a traditional retrieval engine, then on the DGD network to find new documents. The idea is to suppose that the suggestions given by Ama- zon can be relevant to the user queries. Algorithm 1 takes as inputs: Dinit returned list of docu- ments for each topic by the retrieval techniques described in Section 3, DGD network and parameter β which is the number of the top selected StartingN ode from Dinit de- noted by DStartingN odes . We fixed β to 100 (10% of the returned list for each topic). The algorithm returns a list of recommendations for each topic denoted by “Df inal ”. It Figure 3: Example of Directed Graph of Documents processes topic by topic, and extracts the list of all neighbors for each StartingN ode. It performs mutual Shortest Paths computation between all selected StartingN ode in DGD. Figure 4 shows the general architecture of our document re- The two lists (neighbors and nodes in computed Shortest trieval system with two-level document search. In this sys- Paths) are concatenated after that all duplicated nodes are tem, the IR Engine finds all relevant documents for user’s deleted. The set of documents in returned list is denoted by query. Then, the Graph Search module selects resulting Dgraph . A second concatenation is performed between initial document returned by Graph Analysis module. The Graph list of documents and Dgraph (all duplications are deleted) in Structured Data is a network constructed using Social Infor- new final list of retrieved documents, Df inal reranked using mation Matrix and enriched by Compute PageRank module. different reranking schemes. The Social Information Matrix is constructed by two mod- ules: “Ratings“ and ”Similar Products“ Extraction from the Algorithm 1 Retrieving based on DGD feedback Data Collection that contains description books in XML for- mat. Scoring Ranking module combines scores of documents 1: Dinit ← Retrieving Documents for each ti ∈ T resulting from IR Engine and Graph Analysis modules and 2: for each Dti ∈ Dinit do reranks them. 3: DStartingN odes ← first β documents ∈ Dti 4: for each StartingN ode in DStartingN odes do 5: Dgraph ← Dgraph + neighbors(StartingN ode, DGD) 6: DSP nodes ← all D ∈ ShortestP ath(StartingN ode, DStartingN odes , DGD) 7: Dgraph ← Dgraph + DSP nodes 8: Delete all duplications from Dgraph 9: Df inal ← Df inal + (Dti + Dgraph ) 10: Delete all duplications from Df inal 11: Rerank Df inal Figure 5 shows an illustration of the document retrieval ap- proach based on DGD feedback. 6. EXPERIMENTS AND RESULTS In this section, we describe the experimental setup we used for our experiments. Furthermore, we present the different Figure 4: Architecture of document retrieval approach based reranking schemes used in previously defined approaches. on graph of documents We discuss the results we achieved by using the InL2 re- trieval model, its combination to the SDM model, and re- trieval system proposed in our approach that uses the DGD In this section, the collection of documents is denoted by network. C. In C, each document d has a unique ID. The set of queries called topics is denoted by T , the set Dinit ⊂ C refers to the documents returned by the initial retrieval model. 6.1 Experiments setup Figure 5: Book retrieval approach based on DGD feedback. Numbers on the arrows refer to the instructions in the Algorithm 1 For our experiments, we used different tools that implement borhood extraction and PageRank calculation. retrieval models and handle the graph processing. First, we used Terrier (TERabyte RetrIEveR)6 Information Re- To evaluate the results of retrieval systems, several measure- trieval framework developed at the University of Glasgow ments have been used for SBS task: Discounted Cumulative [21–23]. Terrier is a modular platform for rapid develop- Gain (nDCG), the most popular measure in IR [11], Mean ment of large-scale IR applications. It provides indexing Average Precision (MAP) which calculates the mean of av- and retrieval functionalities. It is based on DFR framework erage precisions over a set of queries, and other measures: and we used it to deploy InL2 model described in section Recip Rank and Precision at the rank 10 (P@10). 4.1. Further information about Terrier can be found at http://ir.dcs.gla.ac.uk/terrier. 6.2 Reranking Schemes A preprocessing step was performed to convert INEX SBS Two approaches were proposed. The first one (see section corpus into the Trec Collection Format7 , by considering that 4.3) merges the results of two different information retrieval the content of all tags in each XML file is important for in- models which are the Language Model (SDM) and DFR dexing; therefore the whole XML file was transformed on model (InL2). For topic ti , the models give 1000 documents one document identified by its ISBN. Thus, we just need and each retrieved document has an associated score. The two tags instead of all tags in XML, the ISBN and the whole linear combination method uses the following formula to cal- content (named text). culate final score for each retrieved document d by SDM and InL2 models: Sf inal (d, ti ) = α ∗ SInL2 (d, ti ) + (1 − α) ∗ SSDM (d, ti ) Secondly, Indri8 , Lemur Toolkit for Language Modeling and Information Retrieval was used to carry out a language model (SDM) described in section 4.2. Indri is a framework Where SInL2 (d, ti ) and SSDM (d, ti ) are normalized scores. that provides state-of-the-art text search methods and a rich α is the interpolation parameter set up at 0.8 after several structured query language for big collections (up to 50 mil- tests on the 2014 topics. lion documents). It is a part of the Lemur project and devel- oped by researchers from UMass and Carnegie Mellon Uni- versity. We used Porter stemmer and performed Bayesian The second approach (described in 5.1) uses the DGD con- smoothing with Dirichlet priors (Dirichlet prior µ = 1500). structed from the “Similar Products” information. The doc- ument set returned by the retrieval model are fused to the documents in neighbors set and Shortest Path results. We In section 5.1, we have described our approach based on tested many reranking methods that combine the retrieval DGD which includes graph processing. We used NetworkX9 model scores and other scores based on social information. tool of Python to perform shortest path computing, neigh- For each document in the resulting list, we calculated the following scores: 6 http://terrier.org/ 7 http://lab.hypotheses.org/1129 8 http://www.lemurproject.org/indri/ • PageRank, computed using NetworkX tool. It is 9 https://networkx.github.io/ a well-known algorithm that exploits link structure to score the importance of nodes in a graph. Usu- forming reranking with PageRank improves significantly per- ally, it was been used for hyperlink graphs such as the formances but in contrast, it lowers the baseline perfor- Web [24]. The values of PageRank are given by the mances when using the Non-Analogue topic set. This can following formula. be explained by the fact that Analogue topics contain ex- amples of books (Figure 6) which require the use of graph P R(A) = (1 − d) + d(P R(T1 )/C(T1 ) to extract the similar connected books. +... + P R(Tn )/C(Tn )) Where document A has documents T1 ...Tn which point to it (i.e., Similar products). The parameter d is a damping factor set between 0 and 1 (0.85 in our case). C(A) is defined as the number of links going out of page A. • Likeliness, computed from information generated by users (reviews and ratings). It is based on the idea that more the book has a lot of reviews and good ratings, the more interesting it is (it may not be a good or popular book but a book that has a high impact). P r∈RD r Likeliness(D) = log(#reviews(D)) × #reviews(D) Where #reviews(D) is the number of reviews attributed Figure 6: Examples of narratives in Analogue topics to D, RD is the set of reviews of D. The computed scores were normalized using this formula: Using Likeliness scores (in InL2 DGD MnRtg) to rerank re- normalizedscore = oldscore /maxscore . After that, to com- trieved documents decreases significantly the baseline effi- bine the results of retrieval systems and each of normal- ciency for the two topic sets. This means that ratings given ized scores, an intuitive solution is to weight the retrieval by users don’t provide any improvement for the reranking model scores with the previously described scores (normal- performances. ized PageRank and Likeliness). However, this would favor documents with high PageRank and Likeliness scores even though their content is much less related to the topics. 6.3 Results We used two topic sets provided by INEX SBS task in 2014 (680 topics). The systems retrieve 1000 documents per topic. We assessed the narrative field of each topic and provided au- tomatic classification of the topic set into 2 genres. Analogue topics (261) in which users give the already read books (gen- erally, titles and authors) to have similar books. In the sec- ond genre “Non-Analogue” (356 topics), users describe their needs by defining the thematic, interested field, event, etc. without citing other books. Notify that, 63 topics are ig- nored because of their ambiguity. In order to evaluate our IR methodologies described in sec- Figure 7: Histograms that demonstrate and compare the tions 4.3, 5 we performed retrieving for each topic genre indi- number of improved, deteriorated and same results’ topics vidually. The experimental results, which describe the per- using the proposed approaches for MAP measure. (Baseline: formance of the different retrieval systems on Amazon/Li- InL2) braryThing document collection, are shown in Table 1. Figure 7 compares the number of improved, deteriorated As illustrated in Table 1, the system that combines proba- and same results’ topics between the baseline (InL2) and the bilistic model InL2 and the Language Model SDM (InL2 SDM) proposed retrieval systems in term of MAP measure. The achieves a significant improvement for each topic set compar- proposed systems based on DGD graph provide the highest ing to InL2 model (Baseline) but the improvement is highest number of improved topics compared with the combination for Non-Analogue topic set where the content of queries are of IR systems. More precisely, using PageRank to rerank more explicit than the other topic set. This improvement is document produces better results in term of improved top- mainly due to the increase of the number of relevant docu- ics. This results prove the positive impact of linked structure ments that are retrieved by both systems. on document retrieval systems for book recommendation. The results of run InL2 DDG PR using the Analogue topic The depicted results confirm that we are starting with com- set confirm that exploiting structured documents and per- petitive baseline, suggesting that improvements contribute Table 1: Experimental results. The runs are ranked according to nDCG@10. (∗) denotes significance according to Wilcoxon test [9]. In all cases, all of our tests produced two-sided p-value, α = 0.05. Analogue topics Non-Analogue topics Run nDCG@10 Recip Rank MAP P@10 nDCG@10 Recip Rank MAP P@10 InL2 0.1099 0.267 0.072 0.078 0.138 0.207 0.117 0.0579 InL2 SDM 0.1115 (+1%∗ ) ∗ 0.271 (+1% ) 0.073 (+0.6%) 0.079 (+1%∗ ) 0.147(+6%∗ ) ∗ 0.222(+7% ) 0.124(+5%∗ ) 0.0630(+8%∗ ) InL2 DGD PR 0.1111 (+1%∗ ) 0.277 (+3%∗ ) 0.068 (−5%∗ ) 0.082 (+12%) 0.127(−7%∗ ) 0.206(−0.6%∗ ) 0.102(−12%∗ ) 0.0570(−1%∗ ) InL2 DGD LK 0.1043 (−5%) 0.275 (+2%) 0.064(−11%∗ ) 0.082(+5%) 0.130(−5%) 0.214(+3%∗ ) 0.100(−14%∗ ) 0.0676(+16%) by combining output retrieval systems and social link anal- track and the proposed topics in 2014 divided into two classes ysis are indeed meaningful. Analogue and “Non-Analogue”. 7. HUMANITIES AND SOCIAL SCIENCES We presented the first approach that combines the outputs of probabilistic model (InL2) and Language Model (SDM) COLLECTION: GRAPH MODELING AND using a linear interpolation after normalizing scores of each RECOMMENDATION retrieval system. We have shown a significant improvement We tested the proposed approach of recommendation based of baseline results using this combination. on linked documents on Revues.org10 collection. Revues.org is one of the four platforms of OpenEdition11 portal dedi- A novel approach was proposed, based on Directed Graph cated to electronic resources in the humanities and social of Documents (DGD) constructed from social relationships. sciences (books, journals, research blogs, and academic an- It exploits link structure to enrich the returned document nouncements). Revues.org was founded in 1999 and today list by traditional retrieval model (InL2). We performed a it hosts over 400 online journals, i.e. 149000 articles, pro- reranking method using PageRank and Likeliness of each ceedings ans editorials. retrieved document. We built a network of documents from ASp12 journal. It In the future, we would like to construct an evaluation cor- publishes research articles, publication listings and reviews pora from Revues.org collection and develop an evaluation related to the field of English for Specific Purposes (ESP) for process similar to that of INEX SBS task. Another inter- both teaching and research. The network contains 500 doc- esting extension of our work would be using the learning uments and 833 relationships which represent bibliographic to rank techniques to automatically adjust the settings of citations. Each relationship is constructed using BILBO re-ranking parameters. [12], the reference parsing software. BILBO is constructed with annotated corpora from Digital Humanities articles from OpenEdition Revues.org platform. It automatic an- 9. ACKNOWLEDGMENT notates bibliographic references in the bibliography section This work was supported by the French program Investisse- of each document and obtains the corresponding DOI (Digi- ments d’Avenir FSN and the French Région PACA under tal Object Identifier) via CrossRef13 API if such an identifier the projects InterTextes and Agoraweb. exists. 10. REFERENCES Each node in the citation network have a set of properties [1] M. Abolhassani and N. Fuhr. Applying the divergence (ID which is its URL, type, it can be article, editorial, re- from randomness approach for content-only search in view of book, etc., and readers’ clicks number that we called XML documents. pages 409–419, 2004. popularity). The recommender system applied on this net- [2] G. Amati and C. J. Van Rijsbergen. Probabilistic work takes as input user query, generally a small set of short models of information retrieval based on measuring keywords, and performs retrieval step using Solr14 search the divergence from randomness. ACM Trans. Inf. engine. The system extend the returned results with doc- Syst., 20(4):357–389, Oct. 2002. uments in the citation network by using graph algorithms [3] G. Amati and C. J. van Rijsbergen. Probabilistic (neighborhood search and shortest path algorithm) as de- models of information retrieval based on measuring scribed in section5.1. After that, we rerank documents ac- the divergence from randomness. ACM Trans. Inf. cording to the popularity property of each document. Syst., 20(4):357–389, October 2002. We tested the system manually for a small set of user queries, [4] T. Beckers, N. Fuhr, N. Pharo, R. Nordlie, and K. N. and found that for most queries, the results were satisfying. Fachry. Overview and results of the INEX 2009 interactive track. In Research and Advanced Technology for Digital Libraries, 14th European 8. CONCLUSION AND FUTURE WORK Conference, ECDL 2010, Glasgow, UK, September In this paper, we proposed and evaluated approaches of doc- 6-10, 2010. Proceedings, pages 409–412, 2010. ument retrieval in the context of book recommendation. We [5] N. J. Belkin, P. B. Kantor, E. A. Fox, and J. A. Shaw. used the test collection of CLEF Labs Social Book Search Combining the evidence of multiple query 10 representations for information retrieval. Inf. Process. http://www.revues.org/ 11 Manage., 31(3):431–448, 1995. http://www.openedition.org 12 http://www.openedition.org/6457 [6] C. Benkoussas, H. Hamdan, S. Albitar, A. Ollagnier, 13 http://www.crossref.org/ and P. Bellot. Collaborative filtering for book 14 http://lucene.apache.org/solr/ recommandation. In Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18, 2014., model for term dependencies. In R. A. Baeza-Yates, pages 501–507, 2014. N. Ziviani, G. Marchionini, A. Moffat, and J. Tait, [7] L. Bonnefoy, R. Deveaud, and P. Bellot. Do social editors, SIGIR, pages 472–479. ACM, 2005. information help book search? In P. Forner, [21] I. Ounis, G. Amati, V. Plachouras, B. He, J. Karlgren, and C. Womser-Hacker, editors, CLEF C. Macdonald, and C. Lioma. Terrier: A High (Online Working Notes/Labs/Workshop), 2012. Performance and Scalable Information Retrieval [8] H. Bouchard and J.-Y. Nie. ModÃĺles de langue Platform. In Proceedings of ACM SIGIR’06 Workshop appliquÃl’s Ãă la recherche d’information contextuelle. on Open Source Information Retrieval (OSIR 2006), In CORIA, pages 213–224. UniversitÃl’ de Lyon, 2006. 2006. [9] W. B. Croft. Organizing and searching large files of [22] I. Ounis, G. Amati, P. V., B. He, C. Macdonald, and document descriptions. PhD thesis, Cambridge Johnson. Terrier Information Retrieval Platform. In University, 1978. Proceedings of the 27th European Conference on IR [10] R. GuillÃl’n. Gir with language modeling and dfr Research (ECIR 2005), volume 3408 of Lecture Notes using terrier. In C. Peters, T. Deselaers, N. Ferro, in Computer Science, pages 517–519. Springer, 2005. J. Gonzalo, G. Jones, M. Kurimo, T. Mandl, [23] I. Ounis, C. Lioma, C. Macdonald, and V. Plachouras. A. PeÃśas, and V. Petras, editors, Evaluating Systems Research directions in terrier: a search engine for for Multilingual and Multimodal Information Access, advanced retrieval on the web. Novatica/UPGRADE volume 5706 of Lecture Notes in Computer Science, Special Issue on Web Information Access, 2007. pages 822–829. Springer Berlin Heidelberg, 2009. [24] L. Page, S. Brin, R. Motwani, and T. Winograd. The [11] K. Järvelin and J. Kekäläinen. Ir evaluation methods pagerank citation ranking: Bringing order to the web. for retrieving highly relevant documents. In In Proceedings of the 7th International World Wide E. Yannakoudakis, N. Belkin, P. Ingwersen, and M.-K. Web Conference, pages 161–172, Brisbane, Australia, Leong, editors, Proceedings of the 23rd Annual 1998. International ACM SIGIR Conference on Research [25] L. Page, S. Brin, R. Motwani, and T. Winograd. The and Development in Information Retrieval (SIGIR pagerank citation ranking: Bringing order to the web. 2000), pages 41–48, New York, NY, USA, 2000. ACM. Technical Report 1999-66, Stanford InfoLab, [12] Y.-M. Kim, P. Bellot, E. Faath, and M. Dacos. November 1999. Previous number = Automatic annotation of bibliographical references in SIDL-WP-1999-0120. digital humanities books, articles and blogs. In [26] V. Plachouras, B. He, and I. Ounis. University of G. Kazai, C. Eickhoff, and P. Brusilovsky, editors, glasgow at trec 2004: Experiments in web, robust, and BooksOnline, pages 41–48. ACM, 2011. terabyte tracks with terrier. In E. M. Voorhees and [13] M. Koolen, T. Bogers, J. Kamps, G. Kazai, and L. P. Buckland, editors, TREC, volume Special M. Preminger. Overview of the INEX 2014 social book Publication 500-261. National Institute of Standards search track. In Working Notes for CLEF 2014 and Technology (NIST), 2004. Conference, Sheffield, UK, September 15-18, 2014., [27] J. M. Ponte and W. B. Croft. A language modeling pages 462–479, 2014. approach to information retrieval. In Proc. SIGIR, [14] O. Kurland and L. Lee. PageRank without hyperlinks: 1998. Structural re-ranking using links induced by language [28] S. E. Robertson, C. J. van Rijsbergen, and M. F. models. In Proceedings of SIGIR, pages 306–313, 2005. Porter. Probabilistic models of indexing and searching. [15] J. H. Lee. Combining multiple evidence from different In SIGIR, pages 35–56, 1980. properties of weighting schemes. In Proceedings of the [29] F. Song and W. Croft. A general language model for 18th Annual International ACM SIGIR Conference on information retrieval. In Proceedings of the SIGIR Research and Development in Information Retrieval, Conference on Information Retrieval, 1999. SIGIR ’95, pages 180–188, New York, NY, USA, 1995. [30] T. Tao, X. Wang, Q. Mei, and C. Zhai. Language ACM. model information retrieval with document expansion. [16] J. Lin. Pagerank without hyperlinks: Reranking with In R. C. Moore, J. A. Bilmes, J. Chu-Carroll, and pubmed related article networks for biomedical text M. Sanderson, editors, HLT-NAACL. The Association retrieval. BMC Bioinformatics, 9(1), 2008. for Computational Linguistics, 2006. [17] J. Lin. Pagerank without hyperlinks: Reranking with [31] C. Zhai. Statistical Language Models for Information pubmed related article networks for biomedical text Retrieval. Synthesis Lectures on Human Language retrieval. BMC Bioinformatics, 9(1), 2008. Technologies. Morgan and Claypool Publishers, 2008. [18] D. Metzler and W. B. Croft. Combining the language model and inference network approaches to retrieval. Inf. Process. Manage., 40(5):735–750, 2004. [19] D. Metzler and W. B. Croft. A markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’05, pages 472–479, New York, NY, USA, 2005. ACM. [20] D. Metzler and W. B. Croft. A markov random field